Closed danich1 closed 2 years ago
✔️ Deploy Preview for word-lapse ready!
🔨 Explore the source changes: 3eaac2492c533e3df599e6f0145f85143c2deb09
🔍 Inspect the deploy log: https://app.netlify.com/sites/word-lapse/deploys/62294df4b471c900089a8b1f
😎 Browse the preview: https://deploy-preview-31--word-lapse.netlify.app/
I wonder if it would be good to consult with vince re: how he might want to deal with tagged vs standard tokens on the front end and design back from that? Thinking out loud...
I wonder if it would be good to consult with vince re: how he might want to deal with tagged vs standard tokens on the front end and design back from that? Thinking out loud..
Sure. Tagging @vincerubinetti. Curious to know your thoughts about the following point below. We have tagged tokens and non-tagged tokens and trying to figure out best solution to represent tagged v non-tagged.
So, how does this work downstream? I assume that these tokens are going to show up as "XXX_(tagged)" in the frontend, unless Vince strips out that suffix. Perhaps something a little more structured would work, like a tuple that consists of (token, is_tagged) or a dict like {'token': 'XXX', 'is_tagged': False}.
You got it! If we do the latter than we need to let the user know that this token was tagged. Reason for this necessity is that some of the tokens were missed by the tagger but the represent the same thing (i.e.a549 shows up but its tagged id cvcl_0023 also shows up). I'm sure this is a simple fix for Vince if we want this on the front end.
I'm a little out of my element here, in that I don't understand the importance or meaning behind tagged vs. not tagged.
Is this going to be a flag that exists for all words and that needs to be displayed in every visualization? Or is it only for a new visualization that I'm unaware of.
It seems like we just want to have it be a boolean... tagged vs non?
Off the cuff, I'd just say having a little icon next to tagged words? Or, if it's not important that the user be able to distinguish between tagged/non at a quick glance, we can just have that information in a tooltip, e.g. "this word was tagged".
I'm a little out of my element here, in that I don't understand the importance or meaning behind tagged vs. not tagged.
So a natural language processing (NLP) task called named entity recognition (NER) is designed to label words representing concepts. E.g., "dog" would be classified as canine (scientific terminology). This task makes machine learning and other NLP tasks much easier to perform as you know tokens such as "dog" and "puppy" all represent the word canine. This process is nice in theory, but nothing is perfect, and even NER models can miss quite a bit. So tagged vs. non-tagged in this case is seeing the word puppy as a token even though it SHOULD have been tagged as canine.
Is this going to be a flag that exists for all words and that needs to be displayed in every visualization? Or is it only for a new visualization that I'm unaware of.
Only needs to be displayed in the word neighborhood visualization, so users can know if the tagger missed something.
It seems like we just want to have it be a boolean... tagged vs non?
Off the cuff, I'd just say having a little icon image next to tagged words? Or, if it's not important that the user be able to distinguish between tagged/non at a quick glance, we can just have that information in a tooltip, e.g. "this word was tagged".
Image is fine with me and I think boolean would be better than my suffix approach. @falquaddoomi if no qualms we will go with that one.
It sounds like they will be tagged more often than not, and we want the user to know when a word is not tagged? If that's the case, and considering that the neighborhood viz is already pretty crowded (especially with the color blind symbols enabled), I might opt for something more subtle like an asterisk * with a note. But I'll play around with it.
@vincerubinetti : one thing to think about is that a "tagged" entity is a well-defined concept (i.e., something that we can link to elsewhere or draw in additional metadata around if we want to). It's also something that we could autocomplete in the search pretty readily. Essentially, once something is "tagged" we know a lot more about it.
It's also worth noting that some terms are real but unlikely to be tagged because a term probably doesn't exist. "microarray" not being tagged is not an error of the tagger, but it probably doesn't exist as a concept with metadata that can be tagged. I think it would be better to distinguish things that are tagged for this reason. Like: "these are special" as opposed to "these didn't work" for non-tagged ones.
To explain my commits: I erroneously said earlier that the concept mapping dict should be loaded by the API, when in fact it's supposed to be loaded by the workers. I corrected a few other logging-related issues, added a minor concept mapping dict loading optimization, and temporarily patched the frontend so it can deal with the { token, tagged_id }
objects being returned now instead of strings.
As mentioned on Thursday this code has been adapted to incorporate a concept id mapper to report denormalized concepts back to the user instead of uninformative ids. I placed the dictionary as a global variable but if there is a better way to do this let me know.
I also removed the cutoff part of the word2vec models as these new models have been trained to pre-filter words, so performance should go back to expected runtime.