Decision on making FITS redeployable from scratch

podpearson commented 5 years ago

Email from Alistair:

I'm concerned that manual changes are being made to the FITS database, e.g., to correct missing Alfresco study codes, but that those changes are not being captured as code within the FITS repo. I think everything should be captured somehow within the repo, so that someone other than Magnus could come along and redeploy FITS from scratch, using only docs and code within the repo, and get to exactly the state of the database as it is currently.

This issue is to make a decision on whether we will do the above. FWIW I am strongly in favour of doing this. @magnusmanske , perhaps you could give your thoughts on this, maybe including some indication of how much work you think might be involved?

If we do decide to do this, I think the next step would be to create whatever issues are needed to capture all of the work required.

alimanfoo commented 5 years ago

Thanks @podpearson for raising this. Just to reiterate, for me this is essential. For such a critical system, we cannot rely on any one person to know or remember everything. Knowledge needs to be shared, and wherever possible to be captured in code or documentation. From my point of view, this project (FITS) is as much about capturing knowledge about systems and data as it is about creating an actual database.

alimanfoo commented 5 years ago

Btw I am happy with any technical solution. As long as there is a reproducible way to deploy an instance of FITS, given only code and documentation hosted in this repo.

magnusmanske commented 5 years ago

That would, of course, involve me re-doing all of the data work I have done for the last, well, year or so.

I will start a separate database to attempt this, and see how it goes. At least I have a "target" to compare to...

magnusmanske commented 5 years ago

Note that this would likely not be "reproducible by code", unless you want some horrible script mix to do e.g. one-time imports from solaris (which would have to be kept around indefinitely, under your scenario). My aim here would be to re-do imports and updates, but log them in detail this time around.

podpearson commented 5 years ago

Unless I'm mistaken, the only data we need from Solaris is a single import of 6 fields (oxford_code, path, name, manual_qc, study_group and alfresco) from one view (vw_vrpipe). Couldn't we just write that data out as a single file and put that file in github? Then we wouldn't need to keep Solaris around.

Apologies if I've misunderstood the complexities of the imports here - please expand if you think there is much more to it than this. Perhaps a useful starting point might be to simply list the imports you think will need to be included?

alimanfoo commented 5 years ago

Personally I'd be fine with Richard's suggestion, i.e., taking a one-time dump of the necessary data from solaris to a file and putting it in github or git lfs if it's big.

If there is a sequence of imports and queries being run to build the database, surely we could capture that sequence as code, even if it is just a script with a list of commands and queries to run?

I realise there is also a temporal aspect to dealing with data sources other than solaris, where data is changing and we want to have some process for reimporting periodically. But surely there is some way of handling that in a reproducible way too. E.g., there is a script (or whatever) to run that builds the data base from scratch on whatever day you choose to run the build on, then there are scripts that apply incremental updates on top of that every day or week or whatever. Or even just rebuild from scratch with a relatively low periodicity, e.g., weekly would be more than enough for vector.

On Wed, 19 Dec 2018 at 13:51, Richard Pearson notifications@github.com wrote:

Unless I'm mistaken, the only data we need from Solaris is a single import of 6 fields (oxford_code, path, name, manual_qc, study_group and alfresco) from one view (vw_vrpipe). Couldn't we just write that data out as a single file and put that file in github? Then we wouldn't need to keep Solaris around.

Apologies if I've misunderstood the complexities of the imports here - please expand if you think there is much more to it than this. Perhaps a useful starting point might be to simply list the imports you think will need to be included?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/malariagen/fits/issues/56#issuecomment-448603207, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qnlsu4TBjlRt1M0PBALYivLS-AhJks5u6kRJgaJpZM4ZYlBG .

-- Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

magnusmanske commented 5 years ago

Question: Will SIMS be able to do all that? If not, will you require it?

alimanfoo commented 5 years ago

On Wed, 19 Dec 2018 at 14:57, Magnus Manske notifications@github.com wrote:

Question: Will SIMS be able to do all that? If not, will you require it?

You mean have fully automated build and update processes? I don't know if it will be able to do it, but it should do IMO.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/malariagen/fits/issues/56#issuecomment-448624226, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Ql-5C-lsnjLjX36clY9EjP2LQJtGks5u6lPYgaJpZM4ZYlBG .

-- Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

magnusmanske commented 5 years ago

I do have a fully automated update process. Just not one that fixes all the broken data bits since the beginning of time...

alimanfoo commented 5 years ago

I do have a fully automated update process. Just not one that fixes all the broken data bits since the beginning of time...

Right. I would dearly love to know what all those broken bits are, and how you fixed them.

Then it would be good to figure out, what data sources need patching, and what is the sequence of imports and patches you need to run to create an initial build of the database. And when you subsequently then run incremental updates on top of that, does every update need some patching, or can you just run a simple update from the changing sources?

magnusmanske commented 5 years ago

I am currently running an fresh import on a new database. I am using full logging on INSERT, UPDATE, and DELETE for all relevant tables. All relevant tables also have mandatory notes (source, date at the very least).

Since the file information comes from iRODs/baton, which is very slow, it will take days to import all the file information. So much for "quick re-deploy"...

podpearson commented 5 years ago

@magnusmanske First a couple of questions (which I should probably have asked ages ago, but I didn't want to tread on your toes). Are you sure you need to query iRODS/baton? Is there anything in here that is required that is not in mlwh?

In order to create the Pv4 and Pf6.2 manifests, the only data I required was mlwh and the set of exceptions files (https://github.com/malariagen/SIMS/tree/master/meta/mlwh) that I pointed you to previously. These exceptions files are created by code that is all tied to github issues.

As far as I'm aware, the only "broken bits" in mlwh (together with the code which created them) are the five following.

1) mlwh only holds sequencescape studies, but for all cases except those covered below the file https://github.com/malariagen/SIMS/blob/master/meta/mlwh/sequencescape_alfresco_study_mappings/sequencescape_alfresco_study_mappings_20181017.txt can be used to map sequencescape study to alfresco study

https://github.com/malariagen/SIMS/blob/master/notebooks/rp7/20180203_mlwh_exceptions_files.ipynb
https://github.com/malariagen/SIMS/blob/master/notebooks/rp7/20180913_update_sequencescape_alfresco_study_mappings_file.ipynb
https://github.com/malariagen/SIMS/blob/master/work/24_resolve_discrepancies_between_1196_and_1044/20181017_update_sequencescape_alfresco_study_mappings_file.ipynb

2) 6 fields (oxford_code, path, name, manual_qc, study_group and alfresco) from solaris.vw_vrpipe should overrule mlwh

https://github.com/malariagen/SIMS/blob/master/notebooks/rp7/20180203_mlwh_exceptions_files.ipynb

3) 6 samples needed moving from 1148-PF-BD-MAUDE to 1198-PF-METF-NOSTEN

https://github.com/malariagen/SIMS/blob/master/work/25_update_study_exceptions_file_from_pf_62_errors/20181024_update_study_exceptions_file_from_pf_62_errors.ipynb

4) Sample RCB06893 should be RCN06893

https://github.com/malariagen/SIMS/blob/master/work/26_update_sample_exceptions_file_for_RCN06893/20181025_update_sample_exceptions_file_for_RCN06893.ipynb

5) 5 samples need moving from 1195-PF-TRAC2-DONDORP to 1180-PF-TRAC2-DONDORP

https://github.com/malariagen/SIMS/blob/master/work/27_update_study_exceptions_file_from_pf_62_1180_1195_errors/20181107_update_study_exceptions_file_from_pf_62_1180_1195_errors.ipynb

The above should include relevant Solaris data and sequencescape-alfresco study mappings for both parasite and vector. Unless you have had other suggested changes from the vector team or the lab team, I can't imagine what other broken data bits there are.

Obviously the above isn't currently a script that could be rerun to recreate the manifests for Pv4 and Pf6.2, would it would be fairly trivial to create such a script from these. As such, I think my current "working solution" for creating manifests is almost reproducible in the way Alistair suggests above, so I can't see why this couldn't also be the case for FITS.

Please could you create a version of the FITS database, without importing anything from iRODS/baton, using just the above (or even just 1 and 2, I'm less concerned about 3, 4 and 5 as we could easily ask for these to be changed in sequencescape/mlwh)?

Before starting the actual work, could you first create github issues describing each part of the work, using the recommendations at https://github.com/malariagen/Pipelines_issues/blob/100_best_practices_documentation/documents/malariagen_best_practices.md, in particular "Keep work in each branch as small and self-contained as possible. The smaller the better. Big PRs are hard to review."?

Thanks, Richard

magnusmanske commented 5 years ago

" Are you sure you need to query iRODS/baton?" So let's take, for example, the file "5528_5_human.bam". Where, exactly, do you find information on that file in other Sanger systems? I couldn't find it in Subtrack, and MLWH doesn't store file information.

Via iRODs/baton, I got the following (and more) metadata for that file:

MD5
file size
requested insert size from
requested insert size to
forward read length
reverse read length
flowcell position
iRODs JSON
file type
total reads
sequencing run
CRAM reference
external release
manual qc
is r and d
is paired read
alignment filter
sample common name
library type
MLW legacy library ID
sequenscape library lims ID
sequenscape library name I don't know if we ever need any of that metadata for that file. The point is, I don't know if we'll ever need any of that metadata.

alimanfoo commented 5 years ago

FWIW for vector as a minimum we need for each file:

file path
sequenscape sample ID
sample supplier name
ENA run accession
An identifier for the library (currently we use MLW legacy library ID)

I don't think we need anything else. We can figure out the file type from the path. A file doesn't get an ENA run accession unless it passed manual QC, so as long as we have the run accession that also implies QC information I believe.

I think your message suggests that even for this minimal set, we would need to query both MLW and iRODs/baton? E.g., to get the library information?

FWIW if we did have to query iRODS/baton but speed is an issue, I would have thought we could assume that data for any given file is unchanging, and build our own cache of the subset of iRODS/baton file data we need.

podpearson commented 5 years ago

In the specific example of 5528_5_human.bam, we don't have access to this file:

$ iget /seq/5528/5528_5_human.bam
ERROR: getUtil: get error for ./5528_5_human.bam status = -818000 CAT_NO_ACCESS_PERMISSION

and also, I think this is the reads that map to the human reference, and hence we shouldn't have access to this, and therefore I would argue this shouldn't be in FITS.

Having said this, I guess the more general point is that there could be files that we think we should access that are only accessible via an iRODS/baton query. I have set up a separate issue to start looking at this ( https://github.com/malariagen/fits/issues/58 ), but I feel this can wait until a later date, and that the first priority should be to do the work as outlined in my previous message.

podpearson commented 5 years ago

Regarding the list of metadata you get from iRODS @magnusmanske , I think most if not all of that is also available in mlwh, right? If you think there is metadata that is only available in iRODS, could you let us know what it is so we can make a decision about whether we think it is needed or not?

podpearson commented 5 years ago

Are we all now agreed that we are going to make FITS deployable from scratch? If so, @magnusmanske , could you create new issues to discuss exactly what that means (including e.g. discussions about the scope of this), and then close the current issue?

alimanfoo commented 5 years ago

For my part if FITS is going to be something more than a stop-gap then there should be a fully-automated process for building it from scratch, as well as automated processes for incremental updates.

That said, I'm getting the sense from other discussions that there's a lot going into FITS that we don't strictly need right now for building FOFNs, at least for vector work, and so there may be a lot of potential for simplification. So I wonder if we should take a step back given the experience we now have with using FITS, see if we can isolate a truly minimal set of functional requirements (e.g., build FOFNs) and non-functional requirements (e.g., fully automated), and then ask Magnus to share some knowledge of what data sources we need to be pulling from to satisfy those requirements, and what the issues and possible solutions would be in terms of automating the build and update processes.

Again, for me, a major purpose of this project is getting knowledge shared from Magnus to other team members. We need to get ourselves on a sustainable footing, and need to be robust to team members coming and going. So we shouldn't be thinking that code and data is the only thing that matters here. Happy to spend some time together in the new year when I'm next at Sanger.

On Thu, 20 Dec 2018 at 10:02, Richard Pearson notifications@github.com wrote:

Are we all now agreed that we are going to make FITS deployable from scratch? If so, @magnusmanske https://github.com/magnusmanske , could you create new issues to discuss exactly what that means (including e.g. discussions about the scope of this), and then close the current issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/malariagen/fits/issues/56#issuecomment-448941652, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Quu4y5IkMvm3danGHONev0kGB3vgks5u62AwgaJpZM4ZYlBG .

-- Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

magnusmanske commented 5 years ago

@podpearson There is information about the run etc. in MLWH, but nothing about actual files. Some of it might be in Subtrack, but as per my example, some is definitely not.

@alimanfoo I see it from another point of view: If we only go for "what do we need today to create a FOFN", that will lead to a stop-gap measure. We need to do this properly or not at all. As by my own, subjective importance criteria:

Most important: What is the status now (current FITS provides).
Important: Ability to do forensics (where did this come from?). Partially done by current FITS, fully done by my re-run.
Maybe-nice-to-have: Repopulate everything at the touch of a button. That's what database dumps are for.

alimanfoo commented 5 years ago

Hi Magnus,

On Thu, 20 Dec 2018 at 12:57, Magnus Manske notifications@github.com wrote:

@alimanfoo https://github.com/alimanfoo I see it from another point of view: If we only go for "what do we need today to create a FOFN", that will lead to a stop-gap measure. We need to do this properly or not at all. As by my own, subjective importance criteria:

Most important: What is the status now (current FITS provides).
Important: Ability to do forensics (where did this come from?). Partially done by current FITS, fully done by my re-run.
Maybe-nice-to-have: Repopulate everything at the touch of a button. That's what database dumps are for.

FWIW wanting to have an automated process to build the database from scratch is nothing to do with convenience or disaster recovery or anything like that. I don't even care if is a single push-button, or some document that says "run these scripts/queries/... in this order". The point is that it is reproducible, and that the knowledge of how to reproduce it has been captured. For me, "doing it properly" means (1) providing at least the minimal data we (vector and parasite teams) require, and (2) having a system that can be (re)deployed by anyone given code and documentation.

Ultimately FITS is an interface over a number of data sources, that hides many of the complexities of how to query and integrate them, so I think it's reasonable to expect that all of the logic is captured somehow. It's cool if you want to add in more than the minimal data, but let's do that after satisfying point 2.

podpearson commented 5 years ago

@magnusmanske

There is information about the run etc. in MLWH, but nothing about actual files. Some of it might be in Subtrack, but as per my example, some is definitely not.

Sure, I get that mlwh doesn't actually store files, but it is stores information on runs, lanes and lanelets from which it is possible to figure out what the files we need are, as I think I've shown in the code I've written to create Pv4 and Pf6.2 manifests (and the fact that I think these manifests are pretty close to what we would have got by using FITS - right?). FWIW note that for those manifests I only pulled data from subtrack if it was in mlwh, so mlwh was really the source data here.

Yes, of course there might be data in iRODS that we need that isn't in mlwh, but I think we could do a one off exercise to show whether this is the case or not (e.g. #58), and then make the decision of whether we actually need the iRODS data. The example you gave is, I think, not a file that should be in FITS, but if you can find other files that should be in FITS (i.e. contain data on MalariaGEN samples that we have access to), that would probably convince me that we do need to have iRODS/baton as a data source.

malariagen / fits

Decision on making FITS redeployable from scratch #56