greenelab / snorkeling

Extracting biomedical relationships from literature with Snorkel 🏊
Other
59 stars 17 forks source link

Data gen training #23

Closed danich1 closed 6 years ago

danich1 commented 6 years ago

This PR is for updating the snorkeling repo with the data-labeler and the data-gen model notebooks. Basically these two notebooks cover the whole data generation and generative model training section of this project. Unfortunately, this is one of two big repos. (Sacrificed updating frequency to get things running). Let me know what you think @dhimmel. Got a lot of tacos coming your way after these two repos are done.

danich1 commented 6 years ago

I didn't notice any data files getting added besides hetnet_dg_kb.csv. What are the outputs of the two new notebooks? Is everything getting written the database, so you're not writing separate output files that we should track?

The output of the first two notebook is all inside the database. There is a section (in notebook 2) where a datafile is generated, but the purpose of that file is to load the data into the database. I didn't think we needed to track that information since all the data is already in the database and can be accessed through a database dump.

I can add tracking to the ML models that get saved for each iteration, but I think it is better to add that tracking in the next PR (once LSTM is done training). If I should add the Gen model tracking in this PR lmk.