Resolve memory leak, progress updates, sparse-crossentropy

lingpy / pybor

A Python library for borrowing detection based on lexical language models

Apache License 2.0

3 stars 1 forks source link

Resolve memory leak, progress updates, sparse-crossentropy #33

Open fractaldragonflies opened 4 years ago

fractaldragonflies commented 4 years ago

Changes I had made after initial submission. Resolved the memory leak by clearing tensorflow workspace after each neural run. Need to call from each example doing multiple language examples - so cross validation and simulated borrowings. Some changes in file output, initially thought to be related to memory problem, retained even when clear workspace was adopted. Added progress updates for the cross-validation and fake borrowing examples, since neural network takes substantial time if done over the entire WOLD corpus. Dropped 'oversampling' function as not useful and never documented. Changed to sparse representation and corresponding crossentropy for segments, without impact on results. No new functionality, other than progress updates.

LinguList commented 4 years ago

@tresoldi, I trust you in looking into this, but let me know if you want me to have another look.

fractaldragonflies commented 4 years ago

I resolved conflicts with the cross_validate_models, but then when I exited without pulling merge, it discarded them. I suppose they are supposed to be done together. So, I'll wait for your review and if OK to merge, I'll resolve at the time.

fractaldragonflies commented 4 years ago

The entropies.py changes and supporting changes in neural.py (NeuralData and Generator classes) may be more aggressive than we wish to submit with the final paper. They include the solution of the memory leak and progress update problem, the really important problem to fix, but added work I was doing to 'improve' the model to sparse-crossentropy with some related model and fitting changes as well, while leaving the config.py largely unchanged. An alternative to merging this branch would be for me to make a different branch 'finish-PLOS-paper' or similar without the sparse-crossentropy and related stuff.

Not sure of protocol here. Tiago, we can discuss, and decide whether to abandon this merge request.

tresoldi commented 4 years ago

@fractaldragonflies I agree, the changes might be too aggressive, even though they are good and desirable.

There are many alternatives here, in terms of branch and releases. The two best are, in my opinion:

release the code as it will be submitted, without these changes, and refer to this release in the paper (later merging these changes for a version 1.0)
- release a version 1.0 without these changes (but keep the branch), and after the paper is published we merge into a version 1.2

I would prefer the first alternative, but the second is easier, quicker, and less bound to raise problems. Thus, my vote is to keep these changes "reserved", for after the submission.

tresoldi commented 4 years ago

BTW, my suggestion implies making sure the john-post-plos is in sync with master. I have just fixed the three simple conflicts that were there.

LinguList commented 4 years ago

I would explicitly ask not to add any new code with new features before having resubmitted the publication. Let's just stick on this for now, so we go with releasing 1.0 ASAP, and will reference 1.0 in the release.

However, please note that the reviewer asked us to officially publish the German list which I created, so this list should also be lexibankified, and it should be mentioned as the source for the repo.

So I suggest: we need to do a thorough code review next week to see how well all this looks and maybe wait with the release until I found time to either lexibankify my list, or to submit my list to IDS.

New code could then stay in this other branch.

tresoldi commented 4 years ago

Do you have the original list at hand? A proper lexibank dataset would need concept mapping and values, which are missing from the raw Python table: https://github.com/lingpy/pybor/blob/master/src/pybor/dev/data.py

LinguList commented 4 years ago

Yes, they are missing. I have the list, I mean, I typed it. I already indicated that I'll make a lexibank dataset out of it, as part of our response letter (where I assigned this to me). For the library, we can just say we use the list from this lexibank dataset (or IDS contribution) inside of pybor.