Open bruffridge opened 2 years ago
Perhaps another way to think about the problem is named entity recognition, part-of-speech tagging may also be helpful. "a specialized undulated surface structure" - form, "reduce vibrations" - function, "moving through water" - context.
Some other resources that may be helpful courtesy of Herb: https://github.com/marshmellow77/text-summarisation-project https://aws.amazon.com/marketplace/pp/prodview-uzkcdmjuagetk https://aws.amazon.com/comprehend/ https://huggingface.co/spaces/ml6team/keyphrase-extraction
Since multiple people will be working on this issue, it may be helpful to create different branches in the bio-strategy-extractor repository to track code and results for different evaluated methods. Also, please coordinate efforts so different approaches to solving the problem can be explored and results compared.
A colleague just informed me of a paper entitled, "Categorizing biological information based on function–morphology for bioinspired conceptual design". I uploaded it to the Literature folder in Box. Please review for potential application to this problem.
A Question Answering model may be another approach worth looking at. https://huggingface.co/models?pipeline_tag=question-answering&sort=downloads https://paperswithcode.com/task/question-answering https://github.com/sebastianruder/NLP-progress/blob/master/english/question_answering.md
For example, check out the results from this QA model when asked "What is the primary function?" and "What reduces vibrations?"
This paper, Rhetorical Sentence Categorization for Scientific Paper using Word2Vec Semantic Representation seems interesting. For enabling a model to actually locate where parts of a paper/abstract are describing biomimetic function, we could hand annotate a few known sentences and look for cosine similarity using Word2Vec between the labeled sentences and unlabeled sentences (comparing the averages of the vectors in each sentence).
I'm not sure if this would work well, but the reason something like this might be useful is that summarizing an abstract generally may not always result automatically identifying the specific biomimetic function(s) described.
Perhaps another way to think about the problem is named entity recognition, part-of-speech tagging may also be helpful. "a specialized undulated surface structure" - form, "reduce vibrations" - function, "moving through water" - context.
I agree that this could work well, probably by fine-tuning an existing NER model, but the challenge would be to create the training data.
Rishub, for getting the training data, we do have access to Amazon Ground Truth, a labeling service, if that helps
Semantic Role Labelling might be useful for extracting Who (form), What (function), Where (context): https://paperswithcode.com/task/semantic-role-labeling https://nlpprogress.com/english/semantic_role_labeling.html https://web.stanford.edu/~jurafsky/slp3/19.pdf
This survey academic journal of summarization techniques may be of use:
A couple journals worth looking into:
More fine-tuning journals:
This might help with what I need to do. Not sure if this will be useful to anybody else:
https://arxiv.org/pdf/2106.01592v1.pdf : Biomimicry AI overview
https://asmedigitalcollection.asme.org/mechanicaldesign/article-abstract/136/8/081008/454553/Retrieving-Causally-Related-Functions-From-Natural (pdf): Biomimicry function identification.
Steps for NLP Pipeline that we can implement in our algorithm after further literature research:
@rishub-tamirisa Good find on the biomimicry function identification paper. Here's a list of subsequent papers that cited this one, many of which appear to be relevant. https://www.lens.org/lens/scholar/article/088-258-820-290-519/citations/citing
Here's one in particular that looks interesting: http://ceur-ws.org/Vol-2831/paper4.pdf
Thanks. That paper you linked does look interesting. "The preliminary results indicate that the ability to add ontologies to IBID allows it to extract meaning from new documents." I'm definitely going to take a look at the rest of it.
From Nagel: https://www.mdpi.com/2411-9660/2/4/47/htm
https://arxiv.org/pdf/1909.07755.pdf : SpERT Paper
It may be easier to focus initially on one function then expand the pipeline to include other functions. For example, the function "modify/convert thermal energy". Then the problem becomes trying to identify sections of text that describe managing thermal energy. Next comes identifying the "how"; or what does the text describe as the mechanism responsible for the management of thermal energy. Eventually we may want to classify these various "hows" or "strategies" into different categories (form, material, structure, process, or system).
This goes over BioNER: https://www.frontiersin.org/articles/10.3389/fcell.2020.00673/full
This is a page linking to the GitHub for BioNER: https://github.com/JogleLew/bioner-cross-sharing
Brown Clustering algorithm with paper links in the README: https://github.com/yangyuan/brown-clustering
Another academic journal I glanced over. It is called NERO and it is a biomedical named entity (recognition) ontology: https://www.nature.com/articles/s41540-021-00200-x
Different models for entity recognition: https://danlp-alexandra.readthedocs.io/en/latest/docs/tasks/ner.html
I will look at more methods for unsupervised learning for entity recognition. I will also look into DANE tutorials. It looks like it is showing DaNLP or DaNE (which isn't what the paper was discussing) when you search for the tutorials.
Just uploaded a notebook that shows some results of SciBERT-FOBIE, see ( #3 )
https://academic.oup.com/nar/article/43/W1/W535/2467892?login=true https://academic.oup.com/nar/article/36/suppl_2/W399/2506595?login=false https://academic.oup.com/bioinformatics/article/27/19/2721/231031
PolySearch is a classification method that BioNER used as a reference, so I looked into it and it seems pretty helpful. I'm still looking for code, though.
OrganismTagger is another classifier used by the creators of BioNER, and it categorizes biomedical words, which I thought could be helpful. Again, I have only been able to find articles, not code at the moment.
https://arxiv.org/pdf/2104.01364.pdf (SciBERT-CRF paper) https://github.com/akashgnr31/Counts-And-Measurement (repo)
Georgia Institute of Technology has been researching using NLP to build Structure-Behavior-Function models from text.
IBID: https://dilab.gatech.edu/ibid/ (ongoing) DANE: http://dilab.cc.gatech.edu/dane/ (past)
I just re-trained SciBERT-FOBIE on a cleaned version of the dataset. ( #6 )
However, token imbalance is still a big issue.
Here's some more information on OrganismTagger: https://www.semanticsoftware.info/system/files/orgtagger-1.3a.pdf And here's the BioNER repo: https://github.com/phil1995/BioNER
https://staff.science.uva.nl/c.monz/ltl/publications/mtsummit2017.pdf (Fine-tuning for translation models)
@bruffridge I originally thought that HuggingFace allowed you to use train a summarization model on any existing Language Model, but l realized that since models like SciBERT and BERT are encoder-only, you still need to train the summarization decoder from scratch. There are some existing proposed methods for using pretrained encoders for summarization, but none are generally implemented in existing API like HuggingFace, I would need to fork from one of these papers or implement it on my own. I'm still reading through COVIDSum: A linguistically enriched SciBERT-based summarization model for COVID-19 scientific papers and Text Summarization with Pretrained Encoders to get a better idea of how exactly it works. People may also have uploaded some existing academic paper summarizers on HuggingFace, so I will check that as well.
In the meantime, since using SciBERT directly for summarization is non-trivial, I could still test the AskNature data on a more general english language summarization model, like BertSum or T5. The results likely won't be as good because of the number of low-frequency / rare terms present, but we can still see how well it works.
@bruffridge Google Colab works much better, thanks for the suggestion.
Model is currently training: Colab link
@hschilling I might not need to use AWS for this task after all! Although I'm still interested in getting it set up to learn more about how it works, if that's possible. After this model trains, I'd like to either try and implement the paper in the above comment for using SciBERT for text summarization, or fine-tune SciBERT as a language model directly on our existing corpora of biomimicry papers in Box. These might be more compute-intensive, in which case AWS resources might work better than Colab, which has 16 GB memory in the free version.
@rishub-tamirisa ok, I did already request access to AWS for you
A few potentially useful datasets for Scientific NER:
This abstract describes two separate species (actor), function (what), mechanism (how), context sets which I've labelled below. We may need to build an annotated dataset to train and evaluate models on this task. With coreference resolution it may be possible to link the actor of the first triplet "birds" as referring to "Poorwills".
Compared to mammals, there are relatively few studies examining heterothermy in birds. In 13 bird families known to contain heterothermic species, the common poorwill (Phalaenoptilus nuttallii) is the only species that ostensibly hibernates. We used temperature-sensitive radio-transmitters to collect roost and skin temperature (Tskin) data, and winter roost preferences for free-ranging poorwills in southern Arizona. Further, to determine the effect of passive rewarming on torpor bout duration and {active rewarming}[what-1] (i.e., {the use of metabolic heat to increase Tskin}[how-1]), we experimentally shaded seven {birds}[actor-1] {during winter}[context-1] to prevent them from passively rewarming via solar radiation. {Poorwills}[actor-2] {selected winter roosts that were open to the south or southwest}[how-2], facilitating {passive solar warming}[what-2] {in the late afternoon}[context-2]. Shaded birds actively rewarmed following at least 3 days of continuous torpor. Average torpor bout duration by shaded birds was 122 h and ranged from 91 to 164 h. Active rewarming by shaded birds occurred on significantly warmer days than those when poorwills remained torpid. One shaded bird remained inactive for 45 days, during which it spontaneously rewarmed actively on eight occasions. Our findings show that during winter poorwills exhibit physiological patterns and active rewarming similar to hibernating mammals.
Another example. This one includes a sub-span for context within 'what'. Eventually we may want to break down these spans further into sub-spans. For example, one 'what' into two: 'regulate body temperature', and 'regulate brain temperature'. One 'how' into three: 'panting through the nose', 'panting through the mouth', and 'selective brain cooling'.
Reindeer (Rangifer tarandus) are protected against the Arctic winter cold by thick fur of prime insulating capacity and hence have few avenues of heat loss during work. We have investigated how these animals regulate brain temperature under heavy heat loads. Animals were instrumented for measurements of blood flow, tissue temperatures and respiratory frequency (f) under full anaesthesia, whereas measurements were also made in fully conscious animals while in a climatic chamber or running on a treadmill. At rest, brain temperature (Tbrain) rose from 38.5±0.1°C at 10°C to 39.5±0.2°C at 50°C, while f increased from ×7 to ×250 breaths min–1, with a change to open-mouth panting (OMP) at Tbrain 39.0±0.1°C, and carotid and sublingual arterial flows increased by 160% and 500%, respectively. OMP caused jugular venous and carotid arterial temperatures to drop, presumably owing to a much increased respiratory evaporative heat loss. Angular oculi vein (AOV) flow was negligible until Tbrain reached 38.9±0.1°C, but it increased to 0.81 ml min–1 kg–1 at Tbrain 39.2±0.2°C. Bilateral occlusion of both AOVs induced OMP and a rise in Tbrain and f at Tbrain >38.8°C. We propose that {reindeer}[actor] {regulate body and, particularly, brain temperature {under heavy heat loads}[context]}[what] by {a combination of panting, at first through the nose, but later, when the heat load and the minute volume requirements increase due to exercise, primarily through the mouth and that they eventually resort to selective brain cooling}[how].
The goal is to identify and summarize biological strategies found in research papers or other datasources, then abstract them into design strategies.
A biological strategy is a characteristic, mechanism, or process that an organism or ecosystem exhibits to accomplish a particular purpose or function within a particular context or conditions.
The main elements of a biological strategy are:
An example biological strategy:
Strategy: The harbor seal’s whiskers possess a specialized undulated surface structure that reduces vortex-induced vibrations as the whiskers move through water. Organism: harbor seal Part of: Whiskers Function: reduce vibrations Mechanism: a specialized undulated surface structure Context: Moving through water
A bio-inspired design strategy is a statement that articulates the function, mechanism, and context without using biological terms. Instead biological terms are replaced with discipline-neutral synonyms (e.g. replace “fur” with “fibers,” or “skin” with “membrane”).
An example design strategy:
Strategy: While moving through a liquid, a small diameter fiber with an undulated surface structure reduces vortex-induced vibrations.
Inputs: raw text (such as the title and abstract of a biology journal article) #13 Outputs: Biological strategy, design strategy, Organism, Part of, Function, Mechanism, Context.
Early evaluation of large language models such as GPT-3 Davinci have shown promise for generating these outputs given a proper prompt and a few training examples (1-3).
AskNature has manually curated a list of biological strategies on its website, based on research papers. These may be helpful in training a machine learning model. #12
Open source large language models
Commercial large language models
Alternatives to large language models for text summarization (may not work as well) https://aws.amazon.com/blogs/machine-learning/part-1-set-up-a-text-summarization-project-with-hugging-face-transformers/ https://www.projectpro.io/article/transformers-bart-model-explained/553#mcetoc_1fq07mh0qa https://paperswithcode.com/sota/text-summarization-on-gigaword