igematberkeley / NLPChemExtractor

3 stars 0 forks source link

Pattern Matching Snorkel Jerry #27

Open jierui-cell opened 3 years ago

jierui-cell commented 3 years ago

Current Updates:

  1. Learning NLTK tokenizer package to try to convert each sentence into tokens of positions and words, in order to find whether the words between two chemicals follow certain patterns.
  2. Understanding the annotated_data and determining a way to classify every two chemicals as a group. e.g (chemicalA, B, C) are in a sentence, and should be turned into (A,B), (A,C), (B,C). The tuple should then be put into a dictionary, with each dic[chemical_TUPLE] = sentence
  3. Writing labeling functions in Snorkel --> snorkel_labeling_functions.ipynb. In specific, finished writing functions of determining whether there is ['from', 'to'] in between two chemicals, and whether the two chemicals are separated by a VERB or ADVERB.

To-DOs:

  1. Try to finish all the LF functions before the weekend.
  2. Finish the code of turning a sentence into a dictionary of (A, B) and tokenized sentence.

Issues

  1. Efficiency of my code.
  2. Recalling some basic manipulations of Pandas
  3. Don't know how to determine whether a word is a chemical (the is_chemical function that returns True or False). Will need to look at our previous codes.
  4. Did not fully understand the LF_argument_order Function in Snorkel paper, description said "If the candidate product is before the candidate substrate, we label FALSE." Not sure how to determine product and substrate.