Open magnusmanske opened 6 years ago
Comments from email mentioned above are as follows:
I think this is a good start. I think what this particularly needs now is an introduction explaining why this is needed, what it is replacing, etc., and also pointers to what comes next. The descriptions of the tables and fields need expanding a bit - I think you should be able to understand every single field in the database using this document. I think we also need, either in this document, or as a separate document, details of the source data for this, how it is populated, etc.
Specifically I would suggest:
@magnusmanske , could you talk to @sclaugoncalves to understand what the different library IDs available are, and ensure that the ones that get included in FITS are documented here?
Also, note that @alimanfoo has suggested (#6) using markdown for documentation, which I think is a good idea.
I have ported the Google doc over to markdown, here.
I will incorporate some of the above suggestions. Note, however, that this is the database description document. It is not "FITS MVP", or FITS in general. It describes the database, not the philosophic rationale of having a file tracking system.
I will incorporate some of the above suggestions. Note, however, that this is the database description document. It is not "FITS MVP", or FITS in general. It describes the database, not the philosophic rationale of having a file tracking system.
Yes, fair point. However, I think the comments above should be captured somewhere in the documentation. If you think this is not the right place, could you decide where is and document there?
The document is now here
In the following could you:
Comments on this document (note some of these were previously in the comment dated Sep 28 and they haven't been addressed in the latest version):
"It does not describe FITS in total". But we need this overview somewhere, right? Could you
[x] Create a new issue to write this overview document
[x] Write a draft of this
[x] Give a link here to this overview
[x] Why the link between file and tag tables in the diagram? I would have thought these two tables could only be linked via file2tag, as is the case with sample and tag tables. Presumably there is some reason why storage has been special-cased here. Could you explain why this is?
[x] Could you write a description for each field, describing the data the field contains, and why this is needed. E.g. what are ts_created and ts_touched and why are these needed?
[x] Why the need for sample:name? How might this be used? Has it been populated in a consistent way to date, e.g. is it always sequencescape ID?
[x] We need to define the process for tag management somewhere (I'll leave it to you to think about whether that should be in this document or elsewhere), e.g. you say that tags "can" be self-documenting, but presumably this should be a requirement?
[x] We need some guidelines as to that sort of information the note field should contain.
[x] We also need to define the process for adding new tags. If for now this is asking you to add the new tag, that is fine, but let's state that clearly. Also, if you expect this might change in the future, also give details of whatever thoughts you have on this.
[x] "This data is collected from various sources". We need somewhere a more detailed description of how you have done this. Presumably this is https://github.com/malariagen/fits/blob/master/documentation/processes.md, right? If so, could you put a link to this document here?
[x] "The source and date of import is often given in the note field." Is this not always the case? If not, could you explain here why not?
[x] We need details of how to access the database, i.e. ip address, port, username, etc.
[x] This document only describes a subset of the tables, e.g. there is not mention of ag1000, ag20140122, fits_pf62_try1, etc. If these are temporary tables, will they removed before MVP V1? If not, could you briefly describe these other tables and what their purpose is?
[x] Could you also include views, either here or as a separate section?
[x] file_relation table. Do you envisage other uses of this? For example, in the future, if a file is derived by some computational process from one or more other files, would you expect that to be captured in this table?
We already have "overview" and "mvp_v1" for, well, an overview. I have linked to those now. I don't see the point in Yet Another Document to duplicate that information.
I have added some field information to the SQL schema itself, where it does not appear relevant for the main document.
I would rather not add the database access details into a git doc/issue. That's just bad form.
I'm not sure we need guidelines for the notes. They are, by definition, free-form. My guideline recommendation is "use common sense".
We already have "overview" and "mvp_v1" for, well, an overview. I have linked to those now. I don't see the point in Yet Another Document to duplicate that information.
Sorry, I think I forgot there was already an overview document when writing this. The new links go to raw .md file rather than correctly rendered version - could this be fixed?
I have added some field information to the SQL schema itself, where it does not appear relevant for the main document.
OK, what might be useful is an example of how to access this information from the schema itself
I would rather not add the database access details into a git doc/issue. That's just bad form.
Fair point. How about including the details with the exception of the password and have a note saying "contact @magnusmanske for password" or similar?
With this commit, I consider all points here addressed, and close the issue.
@magnusmanske - I've just made a pull request with a few suggested small changes.
I also have a few follow on questions:
"this should be done automatically now, but may be missing from early imports". I find this worrying (see also comments on process doc). Could you retrospectively apply this to older imports. Presumably you still know all of the imports that were done, right?
file_relation
table. Is there a convention here about whether the BAM or the CRAM is considered the "parent"? I'm wondering this in particular because I'm wondering which will get select in your code for creating a manifest when both are available.
Why the need for sample.name? How might this be used? Has it been populated in a consistent way to date, e.g. does the sequencescape number represent anything specific?
Please address the above by making pull requests, and outlining answers to each of the above (including question number) either in the pull request comments, or else as comments in this issue.
This issue should be left open until we have sign-off, i.e. agreement at a production meeting that this is good to go.
based on previous comments from Richard in email 28/06/2018 13:59