semanticize / semanticizest

Standalone Semanticizer
Apache License 2.0
32 stars 15 forks source link

Timeline & beta testing #24

Open piskvorky opened 9 years ago

piskvorky commented 9 years ago

Hello gentlemen,

what is the expected timeline for releasing semanticizest?

I have a client eager to try it out (because semanticizer is so slow, it's a bottleneck for them).

So I'm thinking if there's a way to share (human) resources -- maybe they could do some beta testing and benchmarking on live data, as soon as you proclaim the new semanticizest production-ready?

larsmans commented 9 years ago

This was planned for last month, but the deadline slipped because of other projects that had priority. My own plan is to release a bare-bones version in the second week of January.

piskvorky commented 9 years ago

Great, thanks Lars. Please ping me when you think it's ready (not sure what bare-bones means, perhaps that's enough).

larsmans commented 9 years ago

Bare-bones means we can semanticize with baseline metrics. No count-min sketch needed (you only need that for fitting more complicated models on the output of semanticizest).

larsmans commented 9 years ago

Hi @piskvorky, I think what we have now is ready for beta-testing. Would you like to have a try?

I was wanting to merge #22 before releasing as beta, but it needs tests and I'm not going to postpone further. The current functionality should be close to what the old semanticizer could do.

piskvorky commented 9 years ago

Excellent, thanks! Will check the situation with client and report back.

By the way, I remember we trained an extra model with David Graus, on Yahoo! queries. Is a similar thing possible here? Or does it not make sense? CC @graus .

larsmans commented 9 years ago

There's no re-ranking model in here. If you want that, you'd need to stack a model on top.

graus commented 9 years ago

(Which I will definitely work on if it does not magically appear -- I don't recall [if/where] this item ended up in terms of roadmap. Nor if there's a roadmap.)

larsmans commented 9 years ago

The idea was to provide all the information necessary for feature extraction in such a model. The thing is that we can't ship models, training data or anything, so we can't test this stuff and it will go stale.

piskvorky commented 9 years ago

Thanks guys.

One question: skimming the docs I can't see an obvious answer (will have to check the API in more detail later), but: I remember we had some disambiguation issues and needed semanticizer to return "unnormalized" statistics too, for some local result post-filtering.

We ended up adding something like link['unnormed'] = self.wpm.get_sense_data(ngram, sense_str) to our fork of semanticizer.

Is this already included in semanticizest, or will we have to add it again ourselves (+ pull request)? Or does this question not even make sense for the new code/approach?

larsmans commented 9 years ago

We're not returning that, but we should. There's an XXX in semanticizest/_semanticizer.py where it needs to be added...

c-martinez commented 9 years ago

Speaking of shipping models -- does anyone have an English model and wouldn't mind sharing it with me? I've tried building it myself, but my laptop crashes after processing 3920000 articles. ;-(

larsmans commented 9 years ago

I'm building one right now, ready in three hours. Ping me by email tomorrow morning.

c-martinez commented 9 years ago

Cool thanks :-)

piskvorky commented 9 years ago

We ran some initial checks, created the EN model, and I want to make sure I understand correctly:

semanticizEST only has a single API method, all_candidates(tokens). This returns all candidates for given tokens, no context, no disambiguation.

There's no API like in semanticizER where we send in a text (string) and it gives us back detected entities.

Is that correct?

Are you planning on extending the pipeline? What is the timeline before reaching ± semanticizER functionality? The README says that is the goal.

Thanks! CC @tgalery @graus @larsmans

piskvorky commented 9 years ago

Ping @graus @larsmans clarification of the project goals would be welcome. Please let us know what the status is.

Thanks a lot!

larsmans commented 9 years ago

The status is that we have a working replacement at http://github.com/semanticize/st that is only lacking:

  1. disambiguation
  2. REST API (currently being written)
  3. Python wrapper (that will be this repo, I guess)
  4. documentation

I plan to have this finished this week (and I'm working on Saturday). You're welcome to test this new version. People at UvA are already using it.

piskvorky commented 9 years ago

Thanks Lars.

@tgalery can you keep an eye on this? Once we know how to apply semanticizest at a level where it can replace semanticizer (=API for linking entities from plain text), let's evaluate.

tgalery commented 9 years ago

Will do @piskvorky !

larsmans commented 9 years ago

REST API now works, simple disambiguation in the works but not yet finished.

larsmans commented 9 years ago

@tgalery The package is now ready for beta-testing, AFAIC.

tgalery commented 9 years ago

Thanks @larsmans I will have a go when I have the time.

tgalery commented 9 years ago

Hi @larsmans @piskvorky I finally had time to take a look at this. I have been playing with the Danish model, and it seems that the st project has pretty much the same functionality of the semanticizest project. The only difference is that instead of having a single endpoint where you get all the candidates, one can also get the candidate matching a bestpath and an exact match of the string (as documented here https://github.com/semanticize/st/blob/68465fe840a6087698df8963af5980373c5cedb4/cmd/semanticizest/webserver.go) . Although this is functionality added on top of semanticizest, it seems that there is no spotter per se (i.e. something that determines which surface forms in the whole text are worth extracting candidates from) nor any robust incorporation of context. Am I right ? Are there any plans to incorporate those in the project ?

larsmans commented 9 years ago

Candidate entities are determined by semanticizest itself; this is a consequence of the hash representation. There are also no context features. We don't have plans to add them, but if @dodijk agrees that we're missing them, they could be added.

The plan was to have semanticizest do basic entity linking, and do it fast, without too many dependencies in terms of training sets, with enough useful output for downstream code to improve its results.