Open mrunalimanj opened 3 years ago
Sure. I will set up a meeting with Max to go deeper into the pattern-matching side.
Goals: Get a script working that pattern-recognizes substrates? enzymes? products? other reaction features to test on a subset of training data (by Monday, 4/5)
Currently: Both have notes and are looking into the features independently. Looking to meet during work sessions this week (Tues/Wed, Fri, Sun) to implement code.
onboarding myself for the labeling function workflow working in the fc_igemcomp/2020_nlp/snorkel/labeling-function-scratch-jacob.ipynb notebook
will work on creating a corpus of "candidate" sentences with reactions. This corpus will have a lot of false positives, but I want to write a script that extracts what the common patterns appear in these "candidate" sentences (ie substrate1-verb-preposition-filler-filler-substrate2-preposition-produce). These common patterns will be my "heuristics" for whether a reaction occurs. I will then slowly decrease the size of this corpus by being more specific about which patterns are in this corpus, and see how this affects my heuristics
wrote some helper functions. The most notable function returns true if an enzyme is found within a user-defined window of a sentence with a chemical entity, which I plan to use as an early-stage labeling function
@jierui-cell has his updates here: #27
created chemprot_annotations_merged.csv which is chemprot data merged to resemble featurized brenda data so we can test our labeling functions on an annotated dataset. Important features are ["word_clean", "label", "entity_type", "sentence_clean"]
For the validation set we will use for our labeling functions, I am almost done structuring the chemprot data to be just like brenda! I just need to add indexes to each chemprot sentence and sort them, as well as more clearly identify substrates/products/enzymes instead of just saying they exist somewhere in a sentence. *The almost-finished chemprot data is in the chemprot_sentence_level_cleaned.csv **Full path is fc_igemcomp/2020_nlp/snorkel/chemprot_sentence_level_cleaned.csv
After realizing chemprot might be used as validation data for developing training data, here's a slightly better/summarized documentation of the modifications I made to the chemprot corpus for the NER pipeline, all found in the mar_12_NER
folder!
most processing was done in 20210402_parse_chemprot_data_for_NER_add_dividers_fix_labels.ipynb
.
need to redo processing for dev + test, because my newly made dev/test are also subsets of the original training data upside of that ^^: we will have more training/test data!
- merged
chemprot_corpus/chemprot_training_entities.tsv
,chemprot_corpus/chemprot_training_relations.tsv
, filtered on entries with SUBSTRATE-OF, SUBSTRATE-PRODUCT-OF, PRODUCT-OF labels. saved tofull_enzyme_chemprot_relations.tsv
2. now reformatting abstract data in a manner such that:
- it looks like NER input and
- it can be merged with enzyme_relations.tsv so we can transfer labels over to abstract-level data to get word-level labels.
- Loaded
chemprot_corpus/chemprot_training_abstracts.tsv
, and processed as follows:
- removed all greek characters from
abstract
column entries, output of that inabstract_clean
column- ignored all non-ascii characters to clean
abstract_clean
column, with str.encode('ascii', 'ignore').str.decode('ascii')- removed all abstracts that weren't represented in
full_enzyme_chemprot_relations.tsv
.- tokenized sentences + words and their part of speech using
nltk
[new columns:word_pos
,word
,pos
- using custom parser to get spans of each word --> new columns:
spans
[a tuple: (start, end)],start
,end
representing the relevant character indices for start/stop of the word. this drops out ~ 6 abstracts taht for some reason have a"
in them and therefore can't be parsed properly.- using
full_enzyme_chemprot_relations.tsv
, adding the proper labels (slightly modified to use BIO/BIOUL tagging: B prefix = beginning, I prefix = inside, O prefix = outside) to each word where the spans overlap.- saved, processed CSVs: all columns -->
labeled_chemprot_data_all_cols.csv
{oops, still tab-separated, sorry} andlabeled_chemprot_data_for_NER.csv
{also tab-separated}
SUBSTRATE-PRODUCT-OF
labels into just SUBSTRATE-OF
, PRODUCT-OF
labels20210326_fix_NER_labels.ipynb
:
labeled_chemprot_data_all_cols.csv
, labeled_chemprot_data_for_NER.csv
20210326_set_up_NER_runs_with_dividers.ipynb
labeled_chemprot_data_all_cols.csv
, grouped by abstract, randomly sampled to make 60/20/20 split for train/dev/test files../data/ner/chemprot_sub_enzyme/clean/
{train, dev, test}.txtAlmost done with creating the chemprot "validation" dataset. Features should match BRENDA data now. The last task is for me to reorder the sentences within an abstract. These sentences have no abstract-level index as far as I can see so I will have to order them based on the provided abstract.
chemprot "validation" dataset is finished. The sentences are ordered and indexed; a screenshot is attached below. Again, this dataset is located at fc_igemcomp/2020_nlp/snorkel/chemprot_sentence_level_cleaned.csv
@jierui-cell and @max8lee looking into a more simple, pattern-based approach to identifying substrate/product/enzyme relationships. feel free to add updates here!