igematberkeley / NLPChemExtractor

3 stars 0 forks source link

string pattern layer development #9

Open mrunalimanj opened 3 years ago

mrunalimanj commented 3 years ago

@jierui-cell and @max8lee looking into a more simple, pattern-based approach to identifying substrate/product/enzyme relationships. feel free to add updates here!

jierui-cell commented 3 years ago

Sure. I will set up a meeting with Max to go deeper into the pattern-matching side.

mrunalimanj commented 3 years ago

Goals: Get a script working that pattern-recognizes substrates? enzymes? products? other reaction features to test on a subset of training data (by Monday, 4/5)

max8lee commented 3 years ago

Currently: Both have notes and are looking into the features independently. Looking to meet during work sessions this week (Tues/Wed, Fri, Sun) to implement code.

jacob-Iuo commented 3 years ago

onboarding myself for the labeling function workflow working in the fc_igemcomp/2020_nlp/snorkel/labeling-function-scratch-jacob.ipynb notebook

jacob-Iuo commented 3 years ago

will work on creating a corpus of "candidate" sentences with reactions. This corpus will have a lot of false positives, but I want to write a script that extracts what the common patterns appear in these "candidate" sentences (ie substrate1-verb-preposition-filler-filler-substrate2-preposition-produce). These common patterns will be my "heuristics" for whether a reaction occurs. I will then slowly decrease the size of this corpus by being more specific about which patterns are in this corpus, and see how this affects my heuristics

jacob-Iuo commented 3 years ago

wrote some helper functions. The most notable function returns true if an enzyme is found within a user-defined window of a sentence with a chemical entity, which I plan to use as an early-stage labeling function

mrunalimanj commented 3 years ago

@jierui-cell has his updates here: #27

jacob-Iuo commented 3 years ago

created chemprot_annotations_merged.csv which is chemprot data merged to resemble featurized brenda data so we can test our labeling functions on an annotated dataset. Important features are ["word_clean", "label", "entity_type", "sentence_clean"]

jacob-Iuo commented 3 years ago

For the validation set we will use for our labeling functions, I am almost done structuring the chemprot data to be just like brenda! I just need to add indexes to each chemprot sentence and sort them, as well as more clearly identify substrates/products/enzymes instead of just saying they exist somewhere in a sentence. *The almost-finished chemprot data is in the chemprot_sentence_level_cleaned.csv **Full path is fc_igemcomp/2020_nlp/snorkel/chemprot_sentence_level_cleaned.csv

image
mrunalimanj commented 3 years ago

After realizing chemprot might be used as validation data for developing training data, here's a slightly better/summarized documentation of the modifications I made to the chemprot corpus for the NER pipeline, all found in the mar_12_NER folder!

most processing was done in 20210402_parse_chemprot_data_for_NER_add_dividers_fix_labels.ipynb.

1. started off with chemprotcorpus/training*.tsv

need to redo processing for dev + test, because my newly made dev/test are also subsets of the original training data upside of that ^^: we will have more training/test data!

  1. merged chemprot_corpus/chemprot_training_entities.tsv, chemprot_corpus/chemprot_training_relations.tsv, filtered on entries with SUBSTRATE-OF, SUBSTRATE-PRODUCT-OF, PRODUCT-OF labels. saved to full_enzyme_chemprot_relations.tsv

    2. now reformatting abstract data in a manner such that:

    • it looks like NER input and
    • it can be merged with enzyme_relations.tsv so we can transfer labels over to abstract-level data to get word-level labels.
  2. Loaded chemprot_corpus/chemprot_training_abstracts.tsv, and processed as follows:
    • removed all greek characters from abstract column entries, output of that in abstract_clean column
    • ignored all non-ascii characters to clean abstract_clean column, with str.encode('ascii', 'ignore').str.decode('ascii')
    • removed all abstracts that weren't represented in full_enzyme_chemprot_relations.tsv.
    • tokenized sentences + words and their part of speech using nltk [new columns: word_pos, word, pos
    • using custom parser to get spans of each word --> new columns: spans [a tuple: (start, end)], start, end representing the relevant character indices for start/stop of the word. this drops out ~ 6 abstracts taht for some reason have a " in them and therefore can't be parsed properly.
    • using full_enzyme_chemprot_relations.tsv, adding the proper labels (slightly modified to use BIO/BIOUL tagging: B prefix = beginning, I prefix = inside, O prefix = outside) to each word where the spans overlap.
    • saved, processed CSVs: all columns --> labeled_chemprot_data_all_cols.csv {oops, still tab-separated, sorry} and labeled_chemprot_data_for_NER.csv {also tab-separated}

3: Decoupling SUBSTRATE-PRODUCT-OF labels into just SUBSTRATE-OF, PRODUCT-OF labels

4: splitting up NER files properly with dividers, in 20210326_set_up_NER_runs_with_dividers.ipynb

  1. from labeled_chemprot_data_all_cols.csv, grouped by abstract, randomly sampled to make 60/20/20 split for train/dev/test files
  2. Also for BIOUL parsing, you need spaces + 'DOCSTART' after every sentence/abstract respectively, so those were added in.
  3. output in data folder: ../data/ner/chemprot_sub_enzyme/clean/{train, dev, test}.txt
jacob-Iuo commented 3 years ago

Almost done with creating the chemprot "validation" dataset. Features should match BRENDA data now. The last task is for me to reorder the sentences within an abstract. These sentences have no abstract-level index as far as I can see so I will have to order them based on the provided abstract.

jacob-Iuo commented 3 years ago

chemprot "validation" dataset is finished. The sentences are ordered and indexed; a screenshot is attached below. Again, this dataset is located at fc_igemcomp/2020_nlp/snorkel/chemprot_sentence_level_cleaned.csv image