Open piskvorky opened 9 years ago
This was planned for last month, but the deadline slipped because of other projects that had priority. My own plan is to release a bare-bones version in the second week of January.
Great, thanks Lars. Please ping me when you think it's ready (not sure what bare-bones means, perhaps that's enough).
Bare-bones means we can semanticize with baseline metrics. No count-min sketch needed (you only need that for fitting more complicated models on the output of semanticizest).
Hi @piskvorky, I think what we have now is ready for beta-testing. Would you like to have a try?
I was wanting to merge #22 before releasing as beta, but it needs tests and I'm not going to postpone further. The current functionality should be close to what the old semanticizer could do.
Excellent, thanks! Will check the situation with client and report back.
By the way, I remember we trained an extra model with David Graus, on Yahoo! queries. Is a similar thing possible here? Or does it not make sense? CC @graus .
There's no re-ranking model in here. If you want that, you'd need to stack a model on top.
(Which I will definitely work on if it does not magically appear -- I don't recall [if/where] this item ended up in terms of roadmap. Nor if there's a roadmap.)
The idea was to provide all the information necessary for feature extraction in such a model. The thing is that we can't ship models, training data or anything, so we can't test this stuff and it will go stale.
Thanks guys.
One question: skimming the docs I can't see an obvious answer (will have to check the API in more detail later), but: I remember we had some disambiguation issues and needed semanticizer to return "unnormalized" statistics too, for some local result post-filtering.
We ended up adding something like link['unnormed'] = self.wpm.get_sense_data(ngram, sense_str)
to our fork of semanticizer.
Is this already included in semanticizest
, or will we have to add it again ourselves (+ pull request)? Or does this question not even make sense for the new code/approach?
We're not returning that, but we should. There's an XXX
in semanticizest/_semanticizer.py
where it needs to be added...
Speaking of shipping models -- does anyone have an English model and wouldn't mind sharing it with me? I've tried building it myself, but my laptop crashes after processing 3920000 articles. ;-(
I'm building one right now, ready in three hours. Ping me by email tomorrow morning.
Cool thanks :-)
We ran some initial checks, created the EN model, and I want to make sure I understand correctly:
semanticizEST only has a single API method, all_candidates(tokens)
. This returns all candidates for given tokens, no context, no disambiguation.
There's no API like in semanticizER where we send in a text (string) and it gives us back detected entities.
Is that correct?
Are you planning on extending the pipeline? What is the timeline before reaching ± semanticizER functionality? The README says that is the goal.
Thanks! CC @tgalery @graus @larsmans
Ping @graus @larsmans clarification of the project goals would be welcome. Please let us know what the status is.
Thanks a lot!
The status is that we have a working replacement at http://github.com/semanticize/st that is only lacking:
I plan to have this finished this week (and I'm working on Saturday). You're welcome to test this new version. People at UvA are already using it.
Thanks Lars.
@tgalery can you keep an eye on this? Once we know how to apply semanticizest at a level where it can replace semanticizer (=API for linking entities from plain text), let's evaluate.
Will do @piskvorky !
REST API now works, simple disambiguation in the works but not yet finished.
@tgalery The package is now ready for beta-testing, AFAIC.
Thanks @larsmans I will have a go when I have the time.
Hi @larsmans @piskvorky I finally had time to take a look at this. I have been playing with the Danish model, and it seems that the st project has pretty much the same functionality of the semanticizest project. The only difference is that instead of having a single endpoint where you get all the candidates, one can also get the candidate matching a bestpath and an exact match of the string (as documented here https://github.com/semanticize/st/blob/68465fe840a6087698df8963af5980373c5cedb4/cmd/semanticizest/webserver.go) . Although this is functionality added on top of semanticizest, it seems that there is no spotter per se (i.e. something that determines which surface forms in the whole text are worth extracting candidates from) nor any robust incorporation of context. Am I right ? Are there any plans to incorporate those in the project ?
Candidate entities are determined by semanticizest itself; this is a consequence of the hash representation. There are also no context features. We don't have plans to add them, but if @dodijk agrees that we're missing them, they could be added.
The plan was to have semanticizest do basic entity linking, and do it fast, without too many dependencies in terms of training sets, with enough useful output for downstream code to improve its results.
Hello gentlemen,
what is the expected timeline for releasing semanticizest?
I have a client eager to try it out (because semanticizer is so slow, it's a bottleneck for them).
So I'm thinking if there's a way to share (human) resources -- maybe they could do some beta testing and benchmarking on live data, as soon as you proclaim the new semanticizest production-ready?