neulab / awesome-align

A neural word aligner based on multilingual BERT
https://arxiv.org/abs/2101.08231
BSD 3-Clause "New" or "Revised" License
325 stars 47 forks source link

No caching #11

Closed BramVanroy closed 3 years ago

BramVanroy commented 3 years ago

I am trying to run the script as suggested in the README. I have started it a couple of times, and every time the model needs to be downloaded again and the features need to be recreated. There is no caching at all.

I like this library and its potential (if the results will be reproducible), but its implementation is not very robust and seems implemented in a bit of a hacky, untested, way. This is often the case in research projects (in mine as well), but if the intent is to open-source it and have people use it, it should probably be more robust. I am sure that if you get in touch with the people over at transformers that they can help with better integration in the library and maybe even add the architecture to the library itself! You can tag me there.

BramVanroy commented 3 years ago

It seems that --cache_dir needs to be manually set instead of using the default Transformers cache dir, which is a bit odd. To cache the extracted features, use --cache_data.

zdou0830 commented 3 years ago

Thanks for the suggestion!

I intentionally removed dependencies on transformers because I personally think it would be easier to develop the tool in this way. I recently changed the default cache dir to a temporarily created one after uploading the package to pypi. Since this has caused confusion I've changed the default cache dir to a permanent one (~/.cache/torch/awesome-align, https://github.com/neulab/awesome-align/commit/4c9175b8248690680ed00702e412e8069ba39b20).

Apologies for any inconvenience and contributions are always welcome.

BramVanroy commented 3 years ago

If you can tell me which files you changed (I think only modeling.py but I am not sure about the others) and which ones you copied, I can have a look and integrate it better into transformers in a PR.

Why don't you just use the transformers default cache? Now, if someone uses bert-multilingual in transformer-based code and once with awesome-align, the model needs to be downloaded and saved twice because it's in two different locations.

zdou0830 commented 3 years ago

Thanks! Yes, I think just looking at modeling.py is enough, and it'd be great to integrate it better into transformers as long as this repo doesn't require users to install transformers. (I just hope it can grow independently from transformers, which is why I don't choose to use the transformers default cache.)

BramVanroy commented 3 years ago

Why don't you want to use transformers as a dependency? Even with a fixed release? That would greatly simplify a lot of your library!