nasa-petal / bio-strategy-extractor

The Unlicense
4 stars 1 forks source link

Identify and summarize biological strategies contained in research papers. #1

Open bruffridge opened 2 years ago

bruffridge commented 2 years ago

The goal is to identify and summarize biological strategies found in research papers or other datasources, then abstract them into design strategies.

A biological strategy is a characteristic, mechanism, or process that an organism or ecosystem exhibits to accomplish a particular purpose or function within a particular context or conditions.

The main elements of a biological strategy are:

An example biological strategy:

Strategy: The harbor seal’s whiskers possess a specialized undulated surface structure that reduces vortex-induced vibrations as the whiskers move through water. Organism: harbor seal Part of: Whiskers Function: reduce vibrations Mechanism: a specialized undulated surface structure Context: Moving through water

A bio-inspired design strategy is a statement that articulates the function, mechanism, and context without using biological terms. Instead biological terms are replaced with discipline-neutral synonyms (e.g. replace “fur” with “fibers,” or “skin” with “membrane”).

An example design strategy:

Strategy: While moving through a liquid, a small diameter fiber with an undulated surface structure reduces vortex-induced vibrations.

Inputs: raw text (such as the title and abstract of a biology journal article) #13 Outputs: Biological strategy, design strategy, Organism, Part of, Function, Mechanism, Context.

Early evaluation of large language models such as GPT-3 Davinci have shown promise for generating these outputs given a proper prompt and a few training examples (1-3).

AskNature has manually curated a list of biological strategies on its website, based on research papers. These may be helpful in training a machine learning model. #12

Open source large language models

Commercial large language models

Alternatives to large language models for text summarization (may not work as well) https://aws.amazon.com/blogs/machine-learning/part-1-set-up-a-text-summarization-project-with-hugging-face-transformers/ https://www.projectpro.io/article/transformers-bart-model-explained/553#mcetoc_1fq07mh0qa https://paperswithcode.com/sota/text-summarization-on-gigaword

bruffridge commented 2 years ago

https://huggingface.co/models?pipeline_tag=summarization&sort=downloads

bruffridge commented 2 years ago

Perhaps another way to think about the problem is named entity recognition, part-of-speech tagging may also be helpful. "a specialized undulated surface structure" - form, "reduce vibrations" - function, "moving through water" - context.

bruffridge commented 2 years ago

Some other resources that may be helpful courtesy of Herb: https://github.com/marshmellow77/text-summarisation-project https://aws.amazon.com/marketplace/pp/prodview-uzkcdmjuagetk https://aws.amazon.com/comprehend/ https://huggingface.co/spaces/ml6team/keyphrase-extraction

bruffridge commented 2 years ago

Since multiple people will be working on this issue, it may be helpful to create different branches in the bio-strategy-extractor repository to track code and results for different evaluated methods. Also, please coordinate efforts so different approaches to solving the problem can be explored and results compared.

bruffridge commented 2 years ago

A colleague just informed me of a paper entitled, "Categorizing biological information based on function–morphology for bioinspired conceptual design". I uploaded it to the Literature folder in Box. Please review for potential application to this problem.

bruffridge commented 2 years ago

A Question Answering model may be another approach worth looking at. https://huggingface.co/models?pipeline_tag=question-answering&sort=downloads https://paperswithcode.com/task/question-answering https://github.com/sebastianruder/NLP-progress/blob/master/english/question_answering.md

For example, check out the results from this QA model when asked "What is the primary function?" and "What reduces vibrations?"

rishub-tamirisa commented 2 years ago

This paper, Rhetorical Sentence Categorization for Scientific Paper using Word2Vec Semantic Representation seems interesting. For enabling a model to actually locate where parts of a paper/abstract are describing biomimetic function, we could hand annotate a few known sentences and look for cosine similarity using Word2Vec between the labeled sentences and unlabeled sentences (comparing the averages of the vectors in each sentence).

I'm not sure if this would work well, but the reason something like this might be useful is that summarizing an abstract generally may not always result automatically identifying the specific biomimetic function(s) described.

Perhaps another way to think about the problem is named entity recognition, part-of-speech tagging may also be helpful. "a specialized undulated surface structure" - form, "reduce vibrations" - function, "moving through water" - context.

I agree that this could work well, probably by fine-tuning an existing NER model, but the challenge would be to create the training data.

hschilling commented 2 years ago

Rishub, for getting the training data, we do have access to Amazon Ground Truth, a labeling service, if that helps

bruffridge commented 2 years ago

Semantic Role Labelling might be useful for extracting Who (form), What (function), Where (context): https://paperswithcode.com/task/semantic-role-labeling https://nlpprogress.com/english/semantic_role_labeling.html https://web.stanford.edu/~jurafsky/slp3/19.pdf

abalai-ash commented 2 years ago

This survey academic journal of summarization techniques may be of use:

A couple journals worth looking into:

More fine-tuning journals:

This might help with what I need to do. Not sure if this will be useful to anybody else:

rishub-tamirisa commented 2 years ago

https://arxiv.org/pdf/2106.01592v1.pdf : Biomimicry AI overview

rishub-tamirisa commented 2 years ago

https://asmedigitalcollection.asme.org/mechanicaldesign/article-abstract/136/8/081008/454553/Retrieving-Causally-Related-Functions-From-Natural (pdf): Biomimicry function identification.

abalai-ash commented 2 years ago

Steps for NLP Pipeline that we can implement in our algorithm after further literature research:

  1. Sentence segmentation: breaks the given paragraph into separate sentences.
  2. Word tokenization: extract the words from each sentence one by one.
  3. 'Parts of Speech' Prediction: identifying parts of speech.
  4. Text Lemmatization: figure out the most basic form of each word in a sentence. "Germ" and "Germs" can have two different meanings and we should look to solve that.
  5. 'Stop Words' Identification: English has a lot of filter words that appear very frequently and that introduces a lot of noise.
  6. Dependency Parsing: uses the grammatical laws to figure out how the words relate to one another.
  7. Entity Analysis: go through the text and identify all of the important words or “entities” in the text.
  8. Pronouns Parsing: keeps track of the pronouns with respect to the context of the sentence.
bruffridge commented 2 years ago

@rishub-tamirisa Good find on the biomimicry function identification paper. Here's a list of subsequent papers that cited this one, many of which appear to be relevant. https://www.lens.org/lens/scholar/article/088-258-820-290-519/citations/citing

Here's one in particular that looks interesting: http://ceur-ws.org/Vol-2831/paper4.pdf

rishub-tamirisa commented 2 years ago

Thanks. That paper you linked does look interesting. "The preliminary results indicate that the ability to add ontologies to IBID allows it to extract meaning from new documents." I'm definitely going to take a look at the rest of it.

bruffridge commented 2 years ago

Screen Shot 2022-06-21 at 10 58 56 AM From Nagel: https://www.mdpi.com/2411-9660/2/4/47/htm

rishub-tamirisa commented 2 years ago

https://arxiv.org/pdf/1909.07755.pdf : SpERT Paper

bruffridge commented 2 years ago

It may be easier to focus initially on one function then expand the pipeline to include other functions. For example, the function "modify/convert thermal energy". Then the problem becomes trying to identify sections of text that describe managing thermal energy. Next comes identifying the "how"; or what does the text describe as the mechanism responsible for the management of thermal energy. Eventually we may want to classify these various "hows" or "strategies" into different categories (form, material, structure, process, or system).

abalai-ash commented 2 years ago

DANE paper: https://hal.inria.fr/hal-02279772/file/474537_1_En_1_Chapter.pdf

abalai-ash commented 2 years ago

I will look at more methods for unsupervised learning for entity recognition. I will also look into DANE tutorials. It looks like it is showing DaNLP or DaNE (which isn't what the paper was discussing) when you search for the tutorials.

rishub-tamirisa commented 2 years ago

Just uploaded a notebook that shows some results of SciBERT-FOBIE, see ( #3 )

rblumin24 commented 2 years ago

https://academic.oup.com/nar/article/43/W1/W535/2467892?login=true https://academic.oup.com/nar/article/36/suppl_2/W399/2506595?login=false https://academic.oup.com/bioinformatics/article/27/19/2721/231031

PolySearch is a classification method that BioNER used as a reference, so I looked into it and it seems pretty helpful. I'm still looking for code, though.

OrganismTagger is another classifier used by the creators of BioNER, and it categorizes biomedical words, which I thought could be helpful. Again, I have only been able to find articles, not code at the moment.

rishub-tamirisa commented 2 years ago

https://arxiv.org/pdf/2104.01364.pdf (SciBERT-CRF paper) https://github.com/akashgnr31/Counts-And-Measurement (repo)

bruffridge commented 2 years ago

Georgia Institute of Technology has been researching using NLP to build Structure-Behavior-Function models from text.

IBID: https://dilab.gatech.edu/ibid/ (ongoing) DANE: http://dilab.cc.gatech.edu/dane/ (past)

rishub-tamirisa commented 2 years ago

I just re-trained SciBERT-FOBIE on a cleaned version of the dataset. ( #6 )

However, token imbalance is still a big issue.

rblumin24 commented 2 years ago

Here's some more information on OrganismTagger: https://www.semanticsoftware.info/system/files/orgtagger-1.3a.pdf And here's the BioNER repo: https://github.com/phil1995/BioNER

rishub-tamirisa commented 2 years ago

https://staff.science.uva.nl/c.monz/ltl/publications/mtsummit2017.pdf (Fine-tuning for translation models)

rishub-tamirisa commented 2 years ago

@bruffridge I originally thought that HuggingFace allowed you to use train a summarization model on any existing Language Model, but l realized that since models like SciBERT and BERT are encoder-only, you still need to train the summarization decoder from scratch. There are some existing proposed methods for using pretrained encoders for summarization, but none are generally implemented in existing API like HuggingFace, I would need to fork from one of these papers or implement it on my own. I'm still reading through COVIDSum: A linguistically enriched SciBERT-based summarization model for COVID-19 scientific papers and Text Summarization with Pretrained Encoders to get a better idea of how exactly it works. People may also have uploaded some existing academic paper summarizers on HuggingFace, so I will check that as well.

In the meantime, since using SciBERT directly for summarization is non-trivial, I could still test the AskNature data on a more general english language summarization model, like BertSum or T5. The results likely won't be as good because of the number of low-frequency / rare terms present, but we can still see how well it works.

rishub-tamirisa commented 2 years ago

@bruffridge Google Colab works much better, thanks for the suggestion.

Model is currently training: Colab link

@hschilling I might not need to use AWS for this task after all! Although I'm still interested in getting it set up to learn more about how it works, if that's possible. After this model trains, I'd like to either try and implement the paper in the above comment for using SciBERT for text summarization, or fine-tune SciBERT as a language model directly on our existing corpora of biomimicry papers in Box. These might be more compute-intensive, in which case AWS resources might work better than Colab, which has 16 GB memory in the free version.

hschilling commented 2 years ago

@rishub-tamirisa ok, I did already request access to AWS for you

bruffridge commented 2 years ago

A few resources that may be helpful:

bruffridge commented 2 years ago

A few potentially useful datasets for Scientific NER:

bruffridge commented 2 years ago

This abstract describes two separate species (actor), function (what), mechanism (how), context sets which I've labelled below. We may need to build an annotated dataset to train and evaluate models on this task. With coreference resolution it may be possible to link the actor of the first triplet "birds" as referring to "Poorwills".

Compared to mammals, there are relatively few studies examining heterothermy in birds. In 13 bird families known to contain heterothermic species, the common poorwill (Phalaenoptilus nuttallii) is the only species that ostensibly hibernates. We used temperature-sensitive radio-transmitters to collect roost and skin temperature (Tskin) data, and winter roost preferences for free-ranging poorwills in southern Arizona. Further, to determine the effect of passive rewarming on torpor bout duration and {active rewarming}[what-1] (i.e., {the use of metabolic heat to increase Tskin}[how-1]), we experimentally shaded seven {birds}[actor-1] {during winter}[context-1] to prevent them from passively rewarming via solar radiation. {Poorwills}[actor-2] {selected winter roosts that were open to the south or southwest}[how-2], facilitating {passive solar warming}[what-2] {in the late afternoon}[context-2]. Shaded birds actively rewarmed following at least 3 days of continuous torpor. Average torpor bout duration by shaded birds was 122 h and ranged from 91 to 164 h. Active rewarming by shaded birds occurred on significantly warmer days than those when poorwills remained torpid. One shaded bird remained inactive for 45 days, during which it spontaneously rewarmed actively on eight occasions. Our findings show that during winter poorwills exhibit physiological patterns and active rewarming similar to hibernating mammals.

bruffridge commented 2 years ago

Another example. This one includes a sub-span for context within 'what'. Eventually we may want to break down these spans further into sub-spans. For example, one 'what' into two: 'regulate body temperature', and 'regulate brain temperature'. One 'how' into three: 'panting through the nose', 'panting through the mouth', and 'selective brain cooling'.

Reindeer (Rangifer tarandus) are protected against the Arctic winter cold by thick fur of prime insulating capacity and hence have few avenues of heat loss during work. We have investigated how these animals regulate brain temperature under heavy heat loads. Animals were instrumented for measurements of blood flow, tissue temperatures and respiratory frequency (f) under full anaesthesia, whereas measurements were also made in fully conscious animals while in a climatic chamber or running on a treadmill. At rest, brain temperature (Tbrain) rose from 38.5±0.1°C at 10°C to 39.5±0.2°C at 50°C, while f increased from ×7 to ×250 breaths min–1, with a change to open-mouth panting (OMP) at Tbrain 39.0±0.1°C, and carotid and sublingual arterial flows increased by 160% and 500%, respectively. OMP caused jugular venous and carotid arterial temperatures to drop, presumably owing to a much increased respiratory evaporative heat loss. Angular oculi vein (AOV) flow was negligible until Tbrain reached 38.9±0.1°C, but it increased to 0.81 ml min–1 kg–1 at Tbrain 39.2±0.2°C. Bilateral occlusion of both AOVs induced OMP and a rise in Tbrain and f at Tbrain >38.8°C. We propose that {reindeer}[actor] {regulate body and, particularly, brain temperature {under heavy heat loads}[context]}[what] by {a combination of panting, at first through the nose, but later, when the heat load and the minute volume requirements increase due to exercise, primarily through the mouth and that they eventually resort to selective brain cooling}[how].