Drop attention, replace 'Loan' with 'Borrowed', add possibility to oversample borrowed table.

fractaldragonflies commented 4 years ago

Some cleanup and an experiment.

Dropped Attention... which reduces code in entropies a lot. It was and experiment, and in the end not beneficial. Given the change to data module replacing 'loan' with 'borrowed', replaced throughout the code. Had difficulty processing 'Borrowed' field, so changed the get_access_to_lexibank() to resolve. Please examine to see if OK. Reran tests with use of 'Borrowed'. Problem with donor.py example. So I haven't demonstrated access yet to donor field.

LinguList commented 4 years ago

I have a request to make: Adding the wold-specific functions to data.py does not convince me at all. First, adding new imports half-way of a library is surely not intended, second, it breaks dependencies, as it means: this library is only written for WOLD, while we'd like to keep it open to also use it for other cldf datasets, and third: the specific part with language tables can also be just placed into an "util.py" script in examples or into a wold/data.py module in pybor to make it clearer that this is a very customized application. I can do that easily later, or you could do it before we merge, @fractaldragonflies ?

LinguList commented 4 years ago

BTW: the same applies to the "get_donor_table" function: This won't work if you use another CLDF dataset, so it looses all modularity. I'd say: make a wold-subpackage in pybor, and a subclass for wold that inherits from our generic lexibank class. We may have quite a few different datasets where we can search for borrowed words, so we should not limit ourselves to using WOLD forever.

fractaldragonflies commented 4 years ago

This in a less developed form was one of my waking thoughts this morning. After I share the results on our experiment I’ll look at separating the functions I added for the experiment. While not extensive it was a bit of a hack.

I’ll consult with Tiago about the get donor table. Lexibank isn’t still pretty opaque to me. Hmmmm.

J.E.M.

On Jul 3, 2020, at 5:12 AM, Johann-Mattis List notifications@github.com wrote:

I have a request to make: Adding the wold-specific functions to data.py does not convince me at all. First, adding new imports half-way of a library is surely not intended, second, it breaks dependencies, as it means: this library is only written for WOLD, while we'd like to keep it open to also use it for other cldf datasets, and third: the specific part with language tables can also be just placed into an "util.py" script in examples or into a wold/data.py module in pybor to make it clearer that this is a very customized application. I can do that easily later, or you could do it before we merge, @fractaldragonflies ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

tresoldi commented 4 years ago

Yes, let's do it, we can also Skype if needed. I agree with @LinguList that the library is currently too much tied not only to Lexibank, but to WOLD in particular. While I don't want to pile abstractions, it makes total sense here to have one function for reading plain CSV files, one for CLDF/Lexibank (allowing fields such as donor), and a partial function that deals with WOLD.

fractaldragonflies commented 4 years ago

I would like to take the shortest path to making the 'neural-equal-size’ branch OK to merge.

After that merge, any further refactoring of data is OK by me.
Since it will be depend largely on expertise in the other databases besides WOLD.

So I propose, for an immediate step:

I form a separate module ‘data_donor.py’ to hold the donor specific stuff. Whatever I’ve added to data the mentions donor wold go here. Also subclassing LexibankDataset to add the get_donor_table could go here - at least for now.
I make the changes to rest of my code do separate references from data versus data_donor.py.
Return cross_validate_models_example to its earlier state of NOT referencing donor.
Add cross_validate_donor_models_example … which uses the get_donor_table instead of the get_table.

It’s not the final state, but it would offer enough separation that I think we could merge, before moving on with the changes Tiago proposes.

** This does this work for you???

John Miller millerje@me.com

On Jul 3, 2020, at 7:35 AM, Tiago Tresoldi notifications@github.com wrote:

Yes, let's do it, we can also Skype if needed. I agree with @LinguList https://github.com/LinguList that the library is currently too much tied not only to Lexibank, but to WOLD in particular. While I don't want to pile abstractions, it makes total sense here to have one function for reading plain CSV files, one for CLDF/Lexibank (allowing fields such as donor), and a partial function that deals with WOLD.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lingpy/pybor/pull/27#issuecomment-653527451, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIVSLTWBW2XXPD4PVPVKADLRZXGCVANCNFSM4OMSCYFQ.

LinguList commented 4 years ago

Yes, I'd just call it "wold.py", not data_donor, but I leave that up to you.

fractaldragonflies commented 4 years ago

Made some of the changes discussed above. Others don't seem necessary now. Once wold specific was separated form Lexibank in general, it was easier to simply use the get_donor_table version for the examples. Only 1 conditional testing donor number == 0.

tresoldi commented 4 years ago

@fractaldragonflies , can we merge this and perhaps discuss how to proceed? I was working on these exact files and was preparing to open my PR, but everything would get out-of-sync.

If we merge now, it is easier to divide what to do once we both agree on a common interface (that it, the structure of data the methods will expect).

fractaldragonflies commented 4 years ago

Yes, please, I would like to merge if you're good with that. I had made the change for data.py and new wold.py in this branch because I didn’t want to wait for approval of the merge to complete them. I pushed the config.py that is the same as in the master branch, except for attention. And data.py, which I had pushed earlier, but now without the massive commented out area.

When you wrote previously that you were not able to duplicate my results:

Besides the possibility of different data, and
different code [which I had hoped was not the case, but has to be ruled out],
there is also the possibility of platform. I’m running analyses on my macbook pro.

Thinking about the PLOS submission, besides the change in format and submission requirements, I/we’ll need to up our game on quality of graphics too I think. Presentation quality counts for more on such venues. Meanwhile, maybe I need to save the files of the entropy calculations for some examples, so that we can create the histograms in a prettier fashion.

tresoldi commented 4 years ago

Great! Sent you an email with more detailed plans and work division. I'll merge here and start organizing the repository/library. Be sure to switch to master in your machine and pull from here.

lingpy / pybor

Drop attention, replace 'Loan' with 'Borrowed', add possibility to oversample borrowed table. #27