Learning NLTK tokenizer package to try to convert each sentence into tokens of positions and words, in order to find whether the words between two chemicals follow certain patterns.
Understanding the annotated_data and determining a way to classify every two chemicals as a group. e.g (chemicalA, B, C) are in a sentence, and should be turned into (A,B), (A,C), (B,C). The tuple should then be put into a dictionary, with each dic[chemical_TUPLE] = sentence
Writing labeling functions in Snorkel --> snorkel_labeling_functions.ipynb. In specific, finished writing functions of determining whether there is ['from', 'to'] in between two chemicals, and whether the two chemicals are separated by a VERB or ADVERB.
To-DOs:
Try to finish all the LF functions before the weekend.
Finish the code of turning a sentence into a dictionary of (A, B) and tokenized sentence.
Issues
Efficiency of my code.
Recalling some basic manipulations of Pandas
Don't know how to determine whether a word is a chemical (the is_chemical function that returns True or False). Will need to look at our previous codes.
Did not fully understand the LF_argument_order Function in Snorkel paper, description said "If the candidate product is before the candidate substrate, we label FALSE." Not sure how to determine product and substrate.
Current Updates:
To-DOs:
Issues