nasa-petal / PeTaL-labeller

The PeTaL labeler labels journal articles with biomimicry functions.
https://petal-labeller.readthedocs.io/en/latest/
The Unlicense
6 stars 3 forks source link

match, migration, mag, mesh #67

Closed elkong closed 3 years ago

elkong commented 3 years ago

Summary

Migrated notebook code to .py source files in auto-labeler/MATCH/py. Also finally found a fix for the issue where MATCH was ignoring MAG and MeSH terms!

Related Issues

Backwards incompatibilities

None.

New Dependencies

None.

elkong commented 3 years ago

@bruffridge No; they're in MATCH/joint/run.sh and MATCH/joint/Preprocess_PeTaL.py, which are in the MATCH.tar.gz tarball on the PeTaL drive. @pjuangph and I had a talk about this and have agreed it's best if I refactor everything so that that code for our modified MATCH itself is in the repository, leaving just the data on PeTaL drive.

bruffridge commented 3 years ago

@elkong So adding MAG and MeSH to metadata-aware embedding pre-training fixed the bug? The author of MATCH didn't think doing that would make a noticeable difference. https://github.com/yuzhimanhua/MATCH/issues/3#issuecomment-859261566

elkong commented 3 years ago

@bruffridge Well, I fixed the bug in the sense that MATCH can now successfully leverage the signal from MAG and MeSH terms (see the ablation studies with only MAG / MeSH terms) -- previously it was completely ignoring them.

But when you add them to everything else that is already there, it doesn't really make a noticeable difference (see these ablation studies with the rest of the metadata), unless we're playing the Squeeze One or Two Percentage Points More Performance Out Of MATCH game.

bruffridge commented 3 years ago

@elkong Where are these files at? Did you have to manually modify them to fix the bug?

after modifying emb_init.npy, vocab.npy

elkong commented 3 years ago

@bruffridge Ah, I probably ought to make this clearer. So here's the pipeline for MAG and MeSH terms (and other metadata):

  1. they start in the raw data file (here, that's cleaned_lens_output.json, but later golden.json)
  2. embedding pretraining takes them and produces the file PeTaL.joint.emb in auto-labeler/MATCH/src/MATCH/PeTaL/
  3. the preprocessing step takes PeTaL.joint.emb and cleaned_lens_output.json and produces the rest of the files in auto-labeler/MATCH/src/MATCH/PeTaL/, including emb_init.npy (which contains initial embeddings for every token in the text) and vocab.npy (which contains the vocabulary of tokens in the text). If the preprocessing script sees these files already exist, it skips this part.
  4. MATCH uses emb_init.npy and vocab.npy and all the other files in training and evaluation.

My fix was in step 2 -- I had to modify Preprocess.py (now Preprocess_PeTaL.py) and run.sh (and also, long ago, Makefile) in auto-labeler/MATCH/src/MATCH/joint/ to fix the bug.

But then, after some training, I realised I also had to delete the original run emb_init.npy and vocab.npy and run preprocessing in step 3 again so I could regenerate them correctly and get the new MAG and MeSH signals through to training in step 4.

EDIT: You won't find PeTaL/ in auto-labeler/MATCH/src/MATCH/ yet on GitHub. This is a folder containing lots of our data, which will be downloaded from our Google Drive and appear in auto-labeler/MATCH/src/MATCH/ once you run setup.py in the setup instructions.

bruffridge commented 3 years ago

@elkong OK, now I get it. So, I think the real bug was that emb_int.npy and vocab.npy weren't being updated with the MAG and MeSH terms because they already existed. The author of MATCH said in his studies embedding pretraining didn't accomplish much, so updating Preprocess.py and run.sh could be skipped. Not that it hurts anything though.

elkong commented 3 years ago

@bruffridge Sorry, no; what I consider to be the real bug (the one I spent two weeks or so looking for) was in Preprocess.py and run.sh -- this was the bug that prevented MAG and MeSH terms from entering the space of vocabulary that MATCH could consider in the first place.

It turns out that every time you update your dataset file with novel tokens, the only way that those tokens are going to find their way into vocab.npy is by running Preprocess.py and run.sh again to generate PeTaL.joint.emb. (Then you're going to have to run the step 3 preprocessing again to generate a fresh vocab.npy et al., but that part only took me a few hours to realise.)

That being said, if you know you're not adding any more MAG or MeSH terms, then you can probably skip running embedding pretraining.