Rewrite database description document

magnusmanske commented 6 years ago

based on previous comments from Richard in email 28/06/2018 13:59

podpearson commented 5 years ago

Comments from email mentioned above are as follows:

I think this is a good start. I think what this particularly needs now is an introduction explaining why this is needed, what it is replacing, etc., and also pointers to what comes next. The descriptions of the tables and fields need expanding a bit - I think you should be able to understand every single field in the database using this document. I think we also need, either in this document, or as a separate document, details of the source data for this, how it is populated, etc.

Specifically I would suggest:

Write an introduction which answers the following questions:
- What does this document describe? (an initial prototype of a system to map files to samples?)
- Problem statement - why is the file tracking system needed?
- What does the file tracking system replace?
- What are some known use cases of the file tracking system?
  - Create a build manifest given a set of sequencescape IDs?
  - Create a build manifest given a set of Oxford codes and/or ROMA IDs?
  - Create a build manifest containing all samples from a species?
  - Determine which samples from a given study have been sequenced (Sonia has asked for this a few times recently)
  - Determine how many samples have been sequenced, broken down by species (DK has recently asked for this)
  - Others?
- What are some possible use cases that the file tracking is not intended to cover, but people might assume it would cover? (e.g. any of the above?)
  - If the file tracking system is not intended for this use case, do we know how this use case is expected to be handled in the future?
- Design principles
  - Expand this to full sentences
Database core schema
- Could you make the text a bit bigger?
- Why is there a direct link from storage in file to tag? Why doesn't this go via file2tag? Is storage in some way different to other file properties such as MD5, file size, etc.? If so, need to explain why this is so.
- The SIMS database is using UUIDs as IDs. What do you think about doing the same in the file tracking DB?
Tables
- Could you write a description for each field, describing the data the field contains, and why this is needed. E.g. what are ts_created and ts_touched and why are these needed?
- We also need, either here, or as a separate document, details of the source data for each table, how the table is initially populated, and how the table will be maintained going forwards (e.g. regular updates from other systems, manual changes using SQL or a web front end, etc.
- Could you also include views, either here or as a separate section?
Current list of tags
- What do the terms in brackets represent?
Write a next steps section
- Sequencing not done at WSI. E.g. will tracking system store ENA run accessions?
- Web front end?
- Tie up with SIMS system (maybe just a placeholder for now to say that this needs to happen

podpearson commented 5 years ago

@magnusmanske , could you talk to @sclaugoncalves to understand what the different library IDs available are, and ensure that the ones that get included in FITS are documented here?

Also, note that @alimanfoo has suggested (#6) using markdown for documentation, which I think is a good idea.

magnusmanske commented 5 years ago

I have ported the Google doc over to markdown, here.

I will incorporate some of the above suggestions. Note, however, that this is the database description document. It is not "FITS MVP", or FITS in general. It describes the database, not the philosophic rationale of having a file tracking system.

podpearson commented 5 years ago

I will incorporate some of the above suggestions. Note, however, that this is the database description document. It is not "FITS MVP", or FITS in general. It describes the database, not the philosophic rationale of having a file tracking system.

Yes, fair point. However, I think the comments above should be captured somewhere in the documentation. If you think this is not the right place, could you decide where is and document there?

magnusmanske commented 5 years ago

The document is now here

podpearson commented 5 years ago

In the following could you:

[ ] Separate out any checkbox from this list that you think is not relevant to this particular document either to a new issue, or else as a new checkbox in an existing issue.

Comments on this document (note some of these were previously in the comment dated Sep 28 and they haven't been addressed in the latest version):

"It does not describe FITS in total". But we need this overview somewhere, right? Could you

[x] Create a new issue to write this overview document
[x] Write a draft of this
[x] Give a link here to this overview
[x] Why the link between file and tag tables in the diagram? I would have thought these two tables could only be linked via file2tag, as is the case with sample and tag tables. Presumably there is some reason why storage has been special-cased here. Could you explain why this is?
[x] Could you write a description for each field, describing the data the field contains, and why this is needed. E.g. what are ts_created and ts_touched and why are these needed?
[x] Why the need for sample:name? How might this be used? Has it been populated in a consistent way to date, e.g. is it always sequencescape ID?
[x] We need to define the process for tag management somewhere (I'll leave it to you to think about whether that should be in this document or elsewhere), e.g. you say that tags "can" be self-documenting, but presumably this should be a requirement?
[x] We need some guidelines as to that sort of information the note field should contain.
[x] We also need to define the process for adding new tags. If for now this is asking you to add the new tag, that is fine, but let's state that clearly. Also, if you expect this might change in the future, also give details of whatever thoughts you have on this.
[x] "This data is collected from various sources". We need somewhere a more detailed description of how you have done this. Presumably this is https://github.com/malariagen/fits/blob/master/documentation/processes.md, right? If so, could you put a link to this document here?
[x] "The source and date of import is often given in the note field." Is this not always the case? If not, could you explain here why not?
[x] We need details of how to access the database, i.e. ip address, port, username, etc.
[x] This document only describes a subset of the tables, e.g. there is not mention of ag1000, ag20140122, fits_pf62_try1, etc. If these are temporary tables, will they removed before MVP V1? If not, could you briefly describe these other tables and what their purpose is?
[x] Could you also include views, either here or as a separate section?
[x] file_relation table. Do you envisage other uses of this? For example, in the future, if a file is derived by some computational process from one or more other files, would you expect that to be captured in this table?

magnusmanske commented 5 years ago

We already have "overview" and "mvp_v1" for, well, an overview. I have linked to those now. I don't see the point in Yet Another Document to duplicate that information.

I have added some field information to the SQL schema itself, where it does not appear relevant for the main document.

I would rather not add the database access details into a git doc/issue. That's just bad form.

magnusmanske commented 5 years ago

I'm not sure we need guidelines for the notes. They are, by definition, free-form. My guideline recommendation is "use common sense".

podpearson commented 5 years ago

We already have "overview" and "mvp_v1" for, well, an overview. I have linked to those now. I don't see the point in Yet Another Document to duplicate that information.

Sorry, I think I forgot there was already an overview document when writing this. The new links go to raw .md file rather than correctly rendered version - could this be fixed?

I have added some field information to the SQL schema itself, where it does not appear relevant for the main document.

OK, what might be useful is an example of how to access this information from the schema itself

I would rather not add the database access details into a git doc/issue. That's just bad form.

Fair point. How about including the details with the exception of the password and have a note saying "contact @magnusmanske for password" or similar?

magnusmanske commented 5 years ago

With this commit, I consider all points here addressed, and close the issue.

podpearson commented 5 years ago

@magnusmanske - I've just made a pull request with a few suggested small changes.

I also have a few follow on questions:

"this should be done automatically now, but may be missing from early imports". I find this worrying (see also comments on process doc). Could you retrospectively apply this to older imports. Presumably you still know all of the imports that were done, right?
file_relation table. Is there a convention here about whether the BAM or the CRAM is considered the "parent"? I'm wondering this in particular because I'm wondering which will get select in your code for creating a manifest when both are available.
Why the need for sample.name? How might this be used? Has it been populated in a consistent way to date, e.g. does the sequencescape number represent anything specific?

Please address the above by making pull requests, and outlining answers to each of the above (including question number) either in the pull request comments, or else as comments in this issue.

This issue should be left open until we have sign-off, i.e. agreement at a production meeting that this is good to go.

malariagen / fits

Rewrite database description document #9