The goal is to research a method that will allow us to conduct text extraction and summarization of biological strategies contained in research papers or other data sources.
We want to be able to build a database of biological strategies grouped by function. We will be using AskNature's curated list of biological strategies [^1] when training a machine learning model. We also made use of FOBIE and golden.json in our models. golden.json is a curated list of petalai.org biomimcry papers.
"While scanning the water for these hydrodynamic signals at a swimming speed in the order of meters per second, the seal keeps its long and flexible whiskers in an abducted position, largely perpendicular to the swimming direction. Remarkably, the whiskers of harbor seals possess a specialized undulated surface structure, the function of which was, up to now, unknown. Here, we show that this structure effectively changes the vortex street behind the whiskers and reduces the vibrations that would otherwise be induced by the shedding of vortices from the whiskers (vortex-induced vibrations). Using force measurements, flow measurements, and numerical simulations, we find that the dynamic forces on harbor seal whiskers are, by at least an order of magnitude, lower than those on sea lion (Zalophus californianus) whiskers, which do not share the undulated structure. The results are discussed in the light of pinniped sensory biology and potential biomimetic applications”.
"['specialized undulated surface structure', 'structure effectively changes', 'potential biomimetic applications', 'pinniped sensory biology', 'meters per second', 'harbor seals possess', 'using force measurements', 'vortex street behind', 'induced vibrations ).', 'harbor seal whiskers'...]"
"Using force measurements, flow measurements and numerical simulations, we find that the dynamic forces on harbor seal whiskers are, by at least an order of magnitude, lower than those on sea lion (Zalophus californianus) whiskers, which do not share the undulated structure."
I combined both the output results in the last line of code so we can do a side-by-side comparison to understand the functionalities of the two different methods. We want to include text summarization, so we perform an extraction-based approach where we search the document for key sentences and phrases.
[UPDATE] The above are old results. These are the new results for text summarization including key features of RAKE: "Using force measurements, flow measurements and numerical simulations, we find that the dynamic forces on harbor seal whiskers are, by at least an order of magnitude, lower than those on sea lion (Zalophus californianus) whiskers, which do not share the undulated structure. Remarkably, the whiskers of harbor seals possess a specialized undulated surface structure, the function of which was, up to now, unknown"
We can conclude that this uses a lot more phrases in its text.
The outputs are expected, but more work can be done to combine the two pieces of code so that the text extraction and summarization of biological strategies accurately do what is expected. This is the expected sentence for this example:
" "A small diameter fiber with an undulated surface structure reduces vibrations caused by drag forces" which belongs to the functions "Move, Active Movement, Actively move through liquid" and "Maintain structural integrity, Manage Structural Forces, Manage Drag/Turbulence" ".
[UPDATE] The results were what was to be expected. The main goal now is to incorporate this piece into a neural network.
We have yet to test all of AskNature's curated list of biological strategies. We did test this on a couple petalai.org biomimicry papers and RAKE was able to output keywords/phrases. I used RAKE components to come up with the following resulting output (this is for one of the petalai abstract papers): "The 4 fibrous proteins of honeybee silk are small (∼30 kDa each) and nonrepetitive and adopt a coiled coil structure. Each species produced orthologues of the 4 small fibroin proteins identified in honeybee silk There was extensive sequence divergence among the bee and ant silk genes (<50% similarity between the alignable regions of bee and ant sequences), consistent with constant and equivalent divergence since the bee/ant split (estimated to be 155 Myr). None"
The abstract used from petalai.org was: "Silks are strong protein fibers produced by a broad array of spiders and insects. The vast majority of known silks are large, repetitive proteins assembled into extended β-sheet structures. Honeybees, however, have found a radically different evolutionary solution to the need for a building material. The 4 fibrous proteins of honeybee silk are small (∼30 kDa each) and nonrepetitive and adopt a coiled coil structure. We examined silks from the 3 superfamilies of the Aculeata (Hymenoptera: Apocrita) by infrared spectroscopy and found coiled coil structure in bees (Apoidea) and in ants (Vespoidea) but not in parasitic wasps of the Chrysidoidea. We subsequently identified and sequenced the silk genes of bumblebees, bulldog ants, and weaver ants and compared these with honeybee silk genes. Each species produced orthologues of the 4 small fibroin proteins identified in honeybee silk. Each fibroin contained a continuous predicted coiled coil region of around 210 residues, flanked by 23–160 residue length N- and C-termini. The cores of the coiled coils were unusually rich in alanine. There was extensive sequence divergence among the bee and ant silk genes (<50% similarity between the alignable regions of bee and ant sequences), consistent with constant and equivalent divergence since the bee/ant split (estimated to be 155 Myr). Despite a high background level of sequence diversity, we have identified conserved design elements that we propose are essential to the assembly and function of coiled coil silks."
RAKE extracted these keywords/phrases: "['23 – 160 residue length n', 'continuous predicted coiled coil region', 'ant silk genes (< 50', '4 small fibroin proteins identified', 'small (∼ 30 kda', 'radically different evolutionary solution', 'identified conserved design elements', 'ant sequences ), consistent', 'strong protein fibers produced', 'extensive sequence divergence among', 'found coiled coil structure', '4 fibrous proteins', 'coiled coil structure', 'repetitive proteins assembled', 'species produced orthologues', 'equivalent divergence since', 'coiled coil silks', 'high background level', 'around 210 residues', '155 myr ).', 'honeybee silk genes', 'silk genes', 'subsequently identified', 'ant split', 'fibroin contained', 'coiled coils', 'honeybee silk', 'honeybee silk', 'sequence diversity', 'vast majority', 'unusually rich', 'sheet structures', 'parasitic wasps', 'known silks', 'infrared spectroscopy', 'extended β', 'examined silks', 'building material', 'broad array', 'alignable regions', '3 superfamilies', 'weaver ants', 'bulldog ants', 'found', 'silks', 'ants', 'vespoidea', 'termini', 'spiders', 'similarity', 'sequenced', 'propose', 'nonrepetitive', 'need', 'large', 'insects', 'hymenoptera', 'however', 'honeybees', 'function', 'flanked', 'estimated', 'essential', 'despite', 'cores', 'constant', 'compared', 'chrysidoidea', 'c', 'bumblebees', 'bees', 'bee', 'bee', 'bee', 'assembly', 'apoidea', 'apocrita', 'alanine', 'adopt', 'aculeata']"
We now want to be able to extract key functions out of this biomimcry papers. Aspire by the allenai [^3] was used to come up with a similarity model on matching fine-grained aspects of text. I have used their example demo to see if any of the abstracts (the sample Harbor Seals abstract and a couple abstracts from the golden.json file work and these are the results. The code starts off by importing the required packages, preparing the data/example abstracts, embedding it, and visualizing the optimal transport plans for the computed sentence vectors. The resulting plots are optimal transport plans for the example pairs of abstracts.
The algorithm learns fine-grained document similarity models using co-citations in the same research paper and sentence. Then the "single-match models are learned from implicit supervision in co-citation contexts" (Mysore, Cohan, Hope 2022). Finally, "multi-match models are learned by aligning aspect representations by solving an Optimal Transport problem" (Mysore, Cohan, Hope 2022). Optimal Transport is a method for geometric computation to occur on uncertain data. The final step here is what we are observing in the plots below. We have a two candidates, which are two abstracts with titles and the Query is a way to see what the matched aspects of each of the abstract is. We see that this "method uses multiple matches with an Optimal Transport mechanism that computes Earth Mover's Distance" (Mysore, Cohan, Hope 2022. This method will help with finidng better methods of text representation.
This first one compares the Harbor Seal petalai abstract with the first golden.json abstract. The ones after compare papers within the golden.json file.
The outputs were expected in comparision to the demo. Now, we should figure out a way to identify the functions and incorporate it into a summary with the functions.
SpaCy serves as a backbone to NLP algorithms. We just wanted to test out how each of its features can be used for a NLP Pipeline. The pipeline should include the following:
Feel free to try optimizing this code and test AskNature's curated list of biological startegies. This code is still a work in progress.
If there are any reccomended changes you would like to make, please create an "Issue" on GitHub (For more information please refer to "Where Can Users Get Help" section of this README).
You can create a "New Issue" in the issues section of GitHub. Please refer to the pictures below for the steps on how to create an issue.
[^1]: This is AskNature's curated list of biological strategies: https://asknature.org/biological-strategies/ [^2]: You should be able to download Anaconda for free here: https://www.anaconda.com/products/distribution [^3]: aspire/allenai: https://github.com/allenai/aspire