Can't find T790M mutation in civicmine

hongiiv commented 1 year ago

Hi jakelever,

Thanks for this wonderful project.

When i used the civicmine (http://bionlp.bcgsc.ca/civicmine) i can't find "T790M" in any sentence. It was odd for me because EGFR T790M is very famous biomarker in treatment cancer.

This is a tokenizer problem that Spacy language model (en_core_web_sm) tokenizes the "T790M" as a "T790" and "M". (('T790', 'NOUN'), ('M', 'PROPN'))

I changed the kindred package like this (kindred/Parser.py)

if not model in Parser._models:
      Parser._models[model] = spacy.load(model, disable=['ner'])

      self.nlp = Parser._models[model]
      special_case = [{ORTH: "T790M"}]
      self.nlp.tokenizer.add_special_case("T790M", special_case)

Now "T790M" is ('T790M', 'VERB') fixed.

best, jakelever

jakelever commented 1 year ago

Hi @hongiiv , thanks for looking into this. I'll have a little dig myself and see what other issues there may be.

stale[bot] commented 1 year ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

jakelever / civicmine

Can't find T790M mutation in civicmine #6