Closed ValWood closed 4 years ago
@mah11 is any of this useful for the spreadsheet. If not this can close, we can open a ticket later if we decide to convert to a form.
More thoughts
~1. l also wondered if some of the labels are optimal. When you bee them in the browser track selection, I'm not sure that it is clear what the refer to.~ ~The column labelled "alleles" for example isn't "alleles" it's genotype class (or something). Changing this label may make it clearer why there is an "allele", and a "mutant" column~ ~2. In the spreadsheet why are there 2 track label rows, ar different ones used in different places? Do we need 2 rows??~ ~3. and ideally the labels would match with the headings we use in the browser. These seem to be different:
Where are these configured?~
Below will resolve the above
Where are these configured?
It's "renameFacets" here: https://github.com/pombase/pombase-config/blob/master/website/trackListTemplate.json
Why does the file name column say "short description" do we know?
I don't understand that. Where is the file name column?
In the spreadsheet why are there 2 track label rows, ar different ones used in different places? Do we need 2 rows??
Can you give an example? I can't see any like that. There is one row in the file per track.
I don't understand that. Where is the file name column? Can you give an example? I can't see any like that. There is one row in the file per track.
This is in the submission spreadsheet. We need to settle on the labels (or why we have different ones should be clearer). There might be a reason not to consolidate but it is unclear to me.
In the spreadsheet [template for user submissions] why are there 2 ~track label~ column header rows ... Do we need 2 rows??
The difference is that row 3 is intended as a user-friendly version of the column header, whereas row 4 is the actual text from the metadata configuration file (pombase_jbrowse_track_metadata.csv). I can remove row 4 if you prefer, but I included it for our convenience because the config file doesn't have exactly the same columns in exactly the same order. It has some columns for details that we won't gather from users (e.g. "display_in_jbrowse"). So if it were up to me I would keep both rows 3 and 4 in the user template to save us looking up which user-friendly text corresponds to which config file header, but I won't die on that hill.
This is exactly the sort of feedback I was after when I sent the DRAFT submission template round, and which should have been resolved before the template went out to any users.
The column labelled "alleles" for example isn't "alleles" it's genotype class (or something). Changing this label may make it clearer why there is an "allele", and a "mutant" column
I used "WT or mutant" as the user-friendly header for this column in the submission template draft. I can change it to match whatever else you decide to use as the header in the actual browser track selector.
Happy to keep both rows, but the labels here don't seem to match what is in the display\? I thought they were what. you describe above but then got confused:
I see display labels row3 | row4 | display |
---|---|---|
Data Type | Data_type | Data Type |
Track label | label | Track Label (display matches row 3) |
Assayed Protein | Assayed_gene | localized gene product (matches neither row) |
WT or mutant | Alleles | Alleles (display matches row 4) |
Mating type | mating_type | |
Growth phase or response | growth_phase_or_response | Growth phase or response |
Mutant_alleles | mutant | (display matches row 4) |
what would be a sensible label for the current WT or mutant field? allele type ? (that is what we use in Canto for this classifier)
At present, row 3 in the user submission template matches the "Contents" column on the documentation web page, and row 4 matches the headers in the metadata config .tsv file.
I have not touched the text used in the browser track selection display, but it would make sense to make template row 3, the documentation table, and the track selection display mutually consistent. I do not have strong preferences for what the consistently used text should be for any column. Once you decide what you prefer to use for each bit of content, it will be clear which bits to change where (and I'll be happy to implement).
what would be a sensible label for the current WT or mutant field? allele type ? (that is what we use in Canto for this classifier)
Specifically for this column, I don't think "allele type" works, unless I'm completely misunderstanding what the column is meant to contain. I think it's to say whether the strain is wild type (except for markers & other background alleles) or has one or more mutations in genes of interest. It's not about whether a particular allele is wild type or mutant (that's why it's not "allele type").
OK, these are my suggestions, either to normalise differences, or where other changes might be required. For all excluded rows the website instructions and the track labelling are consistent. It did not occur to me that these might be different from each other hence my lack of checking.
Website (docs) | Track (JBrowse) | row 3 | row4 | Suggested website AND track header | Notes | |
---|---|---|---|---|---|---|
2 | Track Label | Track label | Track Label | label | Track description? | |
3 | Assayed Protein | Localized gene product | Assayed protein | assayed_gene_product | Assayed gene product | |
4 | Experimental background alleles | Background | Experimental background alleles | background | Experimental background alleles | revert to "background" in browser if this makes columns wider, but I think the words will 'stack' |
5 | WT or mutant | Alleles | WT or mutant | Alleles | WT or mutant | |
6 | Mutant alleles | Mutant(s) | Mutant alleles | |||
7 | Mating Type | Mating-type | Isolate and mating type | I put this because h90 is recorded here so I just used the field for 972 h- | ||
11 | Strand | n/a | Strand | |||
16 | Study ID | Study ID | Study ID | Currently mandatory, but some datasets may not have a study ID or database (Lanterman curated for example or intron branch points) | ||
17 | Database | Database | Database | Currently mandatory, but some datasets may not have a study ID or database (Dutrow curated for example or intron branch points) | ||
19 | Data file type | n/a | Data file type | |||
20 | File name | n/a | File name |
Also, we should consider aligning the order of the labelling in the documentation, or the track labels so that they easier to checking. Would it be OK to switch the docs and spreadsheet to match the tracks, or is there a reason not to do so?
OK, these are my suggestions.
We don't need this big complicated table. All I need is what you want displayed to users for each column.
Would it be OK to switch the docs and spreadsheet to match the tracks, or is there a reason not to do so?
I don't know. I made the submission template match the order on the documentation page, but I never have known whether there were specific reasons for any of the column ordering.
I put the columns side by side in a spreadsheet to compare them, and then cut and pasted in the spreadsheet- and I wanted a record of which things are changed, so I could quickly see the logic of the change if you decided to criticise it . It was the most efficient way for me to do it.
Fine. But the most efficient thing for me to use is a simple list of what headings you want users to see.
That is in "Suggested website AND track header"? but the notes need to be taken into account to0.
I don't know. I made the submission template match the order on the documentation page, but I never have known whether there were specific reasons for any of the column ordering.
I thought you did the website doc ordering , but in that case I don't see why we can't switch to match the track labels.
but the notes need to be taken into account to.
to what? (looks like the sentence wasn't complete)
I thought you did the website doc ordering
Antonia did that page.
You know very well that is a typo.
I knew nothing of the sort at the time I read the comment.
mating type (column 7)
I put this because h90 is recorded here so I just used the field for 972 h-
h90 is a mating type. The config file currently has only mating type designations or null.
For the 968/972/975 part, I am tempted to re-think this column and the background (column 4) ... read on.
Experimental background alleles
In light of column 7 above, I wonder if we should revisit what we intend to show using the "background" column.
First, does mating type need to be separated out into its own column? (@kimrutherford @Antonialock) If so, then we should keep it, and only put mating type designations in it.
If not, and we want to include the parental strain designation in that column, should we use a header more like what the PHAFs have, "Parental strain" or "Background strain" (or "strain background")?
Finally, for the backgound alleles, could we use "Background genotype description" as in PHAF? (I think including "experimental" makes it needlessly wordy.)
columns 16 & 17
Currently mandatory, but some datasets may not have a study ID or database
Again, I copied what Antonia put on the documentation page. I'd have thought these issues would have been resolved already, when that page went up. Hey ho.
Anyway, the config file has "not available" entered in the study ID column for most of the rows that don't have IDs. I wouldn't mind insisting that the column be filled out for every row, even those where we have to use "not available", but I don't have a strong preference. It doesn't matter much for the template because there's nothing technologically enforcing mandatory-ness. It's just a free-text field.
Oh, and Lanterman has a study ID.
mating type (column 7)
I put this because h90 is recorded here so I just used the field for 972 h-
h90 is a mating type. The config file currently has only mating type designations or null.
For the 968/972/975 part, I am tempted to re-think this column and the background (column 4) ... read on.
Experimental background alleles
In light of column 7 above, I wonder if we should revisit what we intend to show using the "background" column.
First, does mating type need to be separated out into its own column? (@kimrutherford @Antonialock) If so, then we should keep it, and only put mating type designations in it. no it can go with background
If not, and we want to include the parental strain designation in that column, should we use a header more like what the PHAFs have, "Parental strain" or "Background strain" (or "strain background")?
Finally, for the backgound alleles, could we use "Background genotype description" as in PHAF? (I think including "experimental" makes it needlessly wordy.)
Fine, it was a placeholder until I could think of something better
Oh, and Lanterman has a study ID.
That's OK, the study ID already in the table. This should have said Dutrow, but it's arbitrary really, some datasets don't have study IDs or databases.
The point is that the table and schema were devised before many datasets were hosted and the types of data were envisaged, so we need to expect changes to the protocols, they are a work in progress. Antonia did a great job of this - it's orders of magnitude more formalised than any other database browser dataset hosting . The pages haven't been advertised much or used by the community much so far (as is evident by the lack of submissions) so it isn't surprising that nobody notices there are minor inconsistencies.
h90 is also the name of the standard homothallic strain, so it's a bit ambiguous.
I like Background genotype descriptions and Strain background
I don't think there is any reason to keep "mating type" in its own column. These are data types we devised by us extended from the Ensembl track labels. They aren't a JBrowse requirement.
Probably we wanted to keep the columns widths manageable, but provided that the track labels don't have underscores, they will scale to the longest word
That's OK, the study ID already in the table. This should have said Dutrow, but it's arbitrary really, some datasets don't have study IDs or databases.
Actually correction. There is a Lantermann transcriptome dataset (that is the one with a study ID), but the curated dataset is also from Lanteramann. So, although there is a data source ID, for one dataset, there is an additional dataset (the one we really need, with manually curated data), which will not have an associated study ID. It is curated from multiple sources.
I don't think there is any reason to keep "mating type" in its own column.
Fair 'nuff. But what do you want to capture for backgrounds, and does that (whatever it is) actually need to be split over two columns? If there's value in separating them, I can run with phaf-esque "Strain background" and "Background genotype description". But it's worth making sure it's useful to have two columns at all.
Probably we wanted to keep the columns widths manageable, but provided that the track labels don't have underscores, they will scale to the longest word
I'm not really considering column width in any displays. I'm aiming to omit words that don't add substantive content.
One more question that's simple to ask: in light of our current understanding of incoming data, which columns should be mandatory in the submission template?
Fair 'nuff. But what do you want to capture for backgrounds, and does that (whatever it is) actually need to be split over two columns? If there's value in separating them, I can run with phaf-esque "Strain background" and "Background genotype description". But it's worth making sure it's useful to have two columns at all.
I don't understand the question. I don't really have a good idea precisely what would be captured, but I envisage it would be similar to backgrounds captured during phenotype annotation. I haven't done enough data set hosting to judge. Basically, for background genotype description we want to capture backgrounds that are relevant to the interpretation of the experiment. Background is displayed only so the person browsing the tracks will know what they are looking at. It isn't used often (I see things like the pat1 mutant to initiate meiosis). I guess people would report background for a track in the same circumstances that they report for a phenotype annotation.
I don't see any need to have separate columns for strain background, genotype background or mating type. The columns can be merged unless anyone can think of a reason why not to. It would be better than displaying lots of usually empty columns.
One more question that's simple to ask: in light of our current understanding of incoming data, which columns should be mandatory in the submission template?
It looked OK as is, except for the changes to column 16 and 17 reported above.
I don't understand the question.
It boils down to: What do we need to capture about strains for browser tracks?
A bit more specifically, do browser tracks need the same level of detail as phenotype annotations? (I kind of suspect not.) I'm happy to shove all background details into one "Strain background" column if @kimrutherford thinks that'll be OK (we can have an empty column in the config file temporarily, or even forever, as long as the system can cope with what we put in the background column).
which columns should be mandatory in the submission template?
It looked OK as is, except for the changes to column 16 and 17 reported above.
Do you mean that these columns should be mandatory?
If so, that'll be fine to start with; presumably you'll want to make PubMed ID optional if & when we take pre-publication data.
That sounds fine.
I've done the bulk of the work for this now. The changes in the metadata file and related files are fairly big and sufficiently scary that I've done them as pull requests, complete with requests for Kim to review them before either of us merges them in.
https://github.com/pombase/pombase-chado/pull/785 https://github.com/pombase/pombase-config/pull/32
I have gone out on the limb of updating the website documentation to match the changes, and added & linked up the metadata submission spreadsheet template. Wheee!
This can close when the pull requests are merged. After that, new tickets for further changes.
Kim merged and tested yesterday's changes, and says it all looks fine.
https://github.com/pombase/website/blob/master/src/docs/documentation/data-submission-form-for-HTP-sequence-linked-data.md
Add a link. to the current allowed list
is it possible to describe the track label more simply. Could the form include the individual components as rows?
Here the mandatory field is used for a "comment" should we have a comment field (might be useful to add a note to many of the fields
Experimental background has mandatory Yes?
Isn't very clear. Move the 'comment' to a comment field add a"mutant example
Now I don't understand the difference between 5 and 6 !
Do they need to know how to specify conditions?
might need an explanation, maybe a list of currently allowed? 16,17,18 seem in a bit of a funny order (is database here a prefix)
We should link to a list of all currently available datatypes.