As a developer I want this ecosystem as a containerized image(s)

mjy commented 3 years ago

This is a remarkable demonstration of how to use Git, cutting edge workflows, and cutting edge informatics to accomplish a remarkable study. It's also nearly impossible to replicate unless you are @rafelafrance, or have someone with equivalent skills on the team. Perhaps this is just going to be required for this type of study, that's fine. This doesn't diminish or detract from this exemplary study, it seeks to ask, "what's next?", i.e. how to we industrialize this example? One possibility to move forward is to encapuslate the environment so that those technologically inclined can explore what it looked like on @rafelafrance machine throughout the study.

rafelafrance commented 3 years ago

You raise good points. Yes, this repository is a pile of ad hoc-ery, but there are a few ideas in here worth mining. To do it will require a little effort on your part. I can see from your GitHub profile that you're busy so I'll let you decide if you have the time.

A proposal: I can create a branch with a snapshot of downloaded data. I think that all of the data is now available to the public but I'll have to ask to be sure. For security (and other) reasons, I cannot give access to the Google sheets themselves.

I'd be on the hook for:

Creating a branch so that you can play with the data
Getting a snapshot of the data. Well, for the images it'd have to be a subset of the data.
- I'd have to mock a some of the data relating to the missing images.
Making minor changes that uses this snapshot instead of downloading the Google sheets.
- I would make sure that all changes are well marked.
Helping you with questions.
Fixing some minor bugs.
I should probably provide a post mortem of the code.

You'd be on the hook for:

Trying to run the code and letting me know what you need.
Optionally, creating your own worksheets to explore the Google worksheet code or the complete download process.

mjy commented 3 years ago

Thanks for the feedback, and calling my bluff.

I've spent a little more time looking at what you have in the current repo. Some counter offers, ideas:

I'm interested the schema that you mention should come out of this, particularly light-weight semantics that we could employ in TaxonWorks or elsewhere. I'd at minimum commit to spending some time understanding your core data/concepts if they were summarized, even as tables, in a single place.
I'm increasingly highly leery of any science that depends on Google (just this week I was subject to one of their algorithmic cutoffs for the first time, no email coming in for 2 days). We need alternates. If they decide to cut you off, for "reasons", you're screwed, at least during the data gathering process.
And of course my true colors come through- I think it would be relatively straightforward (of course, time is the bottleneck) to build out the workflow you are using in a more traditional web-app. TaxonWorks has fantastic image ingestion, a curator loaded 4k images in 2 days in a format near identical to what you are requiring here, auto-stubbing specimens, etc. GIven its specimen, labeling, nomenclature etc. tools I think, as far as I can tell, the main missing concept is what you'd call "Sample". Once in place we'd have a very rich set of code/semantics/filters etc. with a robust unit-test framework etc., containerization (everything offline-able via dumps/docker) etc. I'd love to get what you've done and feed it to others, who are requesting tools for their workflows as I type, e.g. we're building out a precursor "Extract"/quantification module as we speak, though this might be part of this workflow.

Thanks for taking the time to engage.

rafelafrance commented 3 years ago

I can provide a simple schema with some narrative on the important fields. This will take a few days.
I completely agree about Google, however, I am not in control of where the data comes from.
There is plenty of low-hanging fruit for improving the system.
- A web app is an excellent idea as are unit tests & containerization. Data audits are also very important.
- 4K images in 2 days? Hmm... we can do better. When I scan for QR-codes (~16,900 images) on my laptop it's completed in well under 4 hours.
- A sample is the part of the plant that is cut off from the museum specimen and put into the envelope with the QR-code as shown in the README. That and taxonomies are two of main concepts of the system. I'll make sure to note them in the explanation of the DB schema.

mjy commented 3 years ago

Great. A row or two of random data per model might help too.

4K images in 2 days? Hmm... we can do better. When I scan for QR-codes (~16,900 images) on my laptop it's completed in well under 4 hours.

I should clarify. This assumes that you had the images in your possesion, these were not in our possession, i.e. immediately after the curator takes them they can drag (sets) drop, and be done as the images are "databased". So, essentially, 0 developer time (infinitely fast!) for a step that produces data that shouldn't require a developer. You don't produce 15k images instantly, your team does, and not fast enough to require more than a couple minutes of archiving at the end of each session (i.e. we don't have the issue of 4k backlogged images anymore, that was a curator hard-testing the system b/w the bulk loading and the UI being "complete"). We also ran batch loaders in the same process to catch a backlog of some 100k images, not as fast as your time, but also not a human bottleneck.

[Edit] We do use tesseract to read the catalog numbers as part of that ingenstion. We can definitely do better/be more rubust adopting your QR code approach!
So in TW Sample is an AnatomicalPart originating_from some CollectionObject with Identifier annotations and a reference to a Container if needed. We're only missing AnatomicalPart, and could proxy with CollectionObject right away.
But I think there is more. The data attributes don't really describe this concept of Sample from I can tell.

rafelafrance commented 3 years ago

https://github.com/rafelafrance/nitfix/blob/master/docs/database_schema.md

mjy commented 3 years ago

Great, thanks. Special thanks for the annotations and being forthcoming on what happened in various places- super useful for future considerations.

I think my next step is to try and get a feel for what parts of the schema were touched by humans at specific points in the workflow. For example, how did curators fill out the plate metadata that was sent to sequencing, or was this assigned by the system. This would let me try write a mirrored story in TW world written up, and to try and highlight the gaps.

mjy commented 3 years ago

Question:

I try to audit for data problems and store the results in error tables.

In this workflow was it useful to go back and forward in time (e.g. git checkout ...) to look at past versions of these error tables, or did you regenerate them as needed and just look at the current reports? I'm trying to tease out whether (regenerated) SQL-based views on the data are enough here.

rafelafrance commented 3 years ago

In this workflow was it useful to go back and forward in time (e.g. git checkout ...) to look at past versions of these error tables,

If I skipped the image scan, which the code will do automatically if there are no new images, I just ran the entire pipeline. It took minutes at most. Just make sure you ingest the manual or other correction data with the rebuild. For this project... I don't think you need to go back in time you should be able to pile on. But as a general rule... I don't know.

[paraphrasing] What can you control with the humans in the loop?

I think that if you create a web app you should be in a better position than I was. Some of the issues I predicted but the error checking was turned off in the Google sheets. On the whole we tried to limit interaction with the DB to scanning QR-codes and filling in templates and reading reports.

On the other hand, the lab work was completely out of my hands and I was completely bind-sided by the untracked replating. If asked I would have worked out something easy-ish to use. Lab techs scanned QR-codes to fill in a Google sheet template for the 96-well plates. That part worked better than expected one once I worked out some really odd input choices. But after that, I was out of the loop and got Google sheets back.

mjy commented 3 years ago

On the whole we tried to limit interaction with the DB to scanning QR-codes and filling in templates and reading reports.

Super useful insight.

rafelafrance commented 3 years ago

You may open another request if there are other issues.

rafelafrance / nitfix

As a developer I want this ecosystem as a containerized image(s) #10