rafelafrance / nitfix

Phylogenomic discovery and engineering of nitrogen fixation into the bioenergy woody crop poplar
MIT License
3 stars 0 forks source link

As a developer I want this ecosystem as a containerized image(s) #10

Closed mjy closed 3 years ago

mjy commented 3 years ago

This is a remarkable demonstration of how to use Git, cutting edge workflows, and cutting edge informatics to accomplish a remarkable study. It's also nearly impossible to replicate unless you are @rafelafrance, or have someone with equivalent skills on the team. Perhaps this is just going to be required for this type of study, that's fine. This doesn't diminish or detract from this exemplary study, it seeks to ask, "what's next?", i.e. how to we industrialize this example? One possibility to move forward is to encapuslate the environment so that those technologically inclined can explore what it looked like on @rafelafrance machine throughout the study.

rafelafrance commented 3 years ago

You raise good points. Yes, this repository is a pile of ad hoc-ery, but there are a few ideas in here worth mining. To do it will require a little effort on your part. I can see from your GitHub profile that you're busy so I'll let you decide if you have the time.

A proposal: I can create a branch with a snapshot of downloaded data. I think that all of the data is now available to the public but I'll have to ask to be sure. For security (and other) reasons, I cannot give access to the Google sheets themselves.

I'd be on the hook for:

You'd be on the hook for:

mjy commented 3 years ago

Thanks for the feedback, and calling my bluff.

I've spent a little more time looking at what you have in the current repo. Some counter offers, ideas:

Thanks for taking the time to engage.

rafelafrance commented 3 years ago
mjy commented 3 years ago

Great. A row or two of random data per model might help too.

4K images in 2 days? Hmm... we can do better. When I scan for QR-codes (~16,900 images) on my laptop it's completed in well under 4 hours.

I should clarify. This assumes that you had the images in your possesion, these were not in our possession, i.e. immediately after the curator takes them they can drag (sets) drop, and be done as the images are "databased". So, essentially, 0 developer time (infinitely fast!) for a step that produces data that shouldn't require a developer. You don't produce 15k images instantly, your team does, and not fast enough to require more than a couple minutes of archiving at the end of each session (i.e. we don't have the issue of 4k backlogged images anymore, that was a curator hard-testing the system b/w the bulk loading and the UI being "complete"). We also ran batch loaders in the same process to catch a backlog of some 100k images, not as fast as your time, but also not a human bottleneck.

rafelafrance commented 3 years ago

https://github.com/rafelafrance/nitfix/blob/master/docs/database_schema.md

mjy commented 3 years ago

Great, thanks. Special thanks for the annotations and being forthcoming on what happened in various places- super useful for future considerations.

I think my next step is to try and get a feel for what parts of the schema were touched by humans at specific points in the workflow. For example, how did curators fill out the plate metadata that was sent to sequencing, or was this assigned by the system. This would let me try write a mirrored story in TW world written up, and to try and highlight the gaps.

mjy commented 3 years ago

Question:

I try to audit for data problems and store the results in error tables.

In this workflow was it useful to go back and forward in time (e.g. git checkout ...) to look at past versions of these error tables, or did you regenerate them as needed and just look at the current reports? I'm trying to tease out whether (regenerated) SQL-based views on the data are enough here.

rafelafrance commented 3 years ago

In this workflow was it useful to go back and forward in time (e.g. git checkout ...) to look at past versions of these error tables,

If I skipped the image scan, which the code will do automatically if there are no new images, I just ran the entire pipeline. It took minutes at most. Just make sure you ingest the manual or other correction data with the rebuild. For this project... I don't think you need to go back in time you should be able to pile on. But as a general rule... I don't know.

[paraphrasing] What can you control with the humans in the loop?

I think that if you create a web app you should be in a better position than I was. Some of the issues I predicted but the error checking was turned off in the Google sheets. On the whole we tried to limit interaction with the DB to scanning QR-codes and filling in templates and reading reports.

On the other hand, the lab work was completely out of my hands and I was completely bind-sided by the untracked replating. If asked I would have worked out something easy-ish to use. Lab techs scanned QR-codes to fill in a Google sheet template for the 96-well plates. That part worked better than expected one once I worked out some really odd input choices. But after that, I was out of the loop and got Google sheets back.

mjy commented 3 years ago

On the whole we tried to limit interaction with the DB to scanning QR-codes and filling in templates and reading reports.

Super useful insight.

rafelafrance commented 3 years ago

You may open another request if there are other issues.