suggestions for the browser data form

ValWood commented 4 years ago

https://github.com/pombase/website/blob/master/src/docs/documentation/data-submission-form-for-HTP-sequence-linked-data.md

Add a link. to the current allowed list
is it possible to describe the track label more simply. Could the form include the individual components as rows?
Here the mandatory field is used for a "comment" should we have a comment field (might be useful to add a note to many of the fields
Experimental background has mandatory Yes?
Isn't very clear. Move the 'comment' to a comment field add a"mutant example
Now I don't understand the difference between 5 and 6 !
Do they need to know how to specify conditions?
might need an explanation, maybe a list of currently allowed? 16,17,18 seem in a bit of a funny order (is database here a prefix)
We should link to a list of all currently available datatypes.

ValWood commented 4 years ago

@mah11 is any of this useful for the spreadsheet. If not this can close, we can open a ticket later if we decide to convert to a form.

ValWood commented 4 years ago

More thoughts

~1. l also wondered if some of the labels are optimal. When you bee them in the browser track selection, I'm not sure that it is clear what the refer to.~ ~The column labelled "alleles" for example isn't "alleles" it's genotype class (or something). Changing this label may make it clearer why there is an "allele", and a "mutant" column~ ~2. In the spreadsheet why are there 2 track label rows, ar different ones used in different places? Do we need 2 rows??~ ~3. and ideally the labels would match with the headings we use in the browser. These seem to be different:

Where are these configured?~

Below will resolve the above

Why does the file name column say "short description" do we know?

kimrutherford commented 4 years ago

Where are these configured?

It's "renameFacets" here: https://github.com/pombase/pombase-config/blob/master/website/trackListTemplate.json

Why does the file name column say "short description" do we know?

I don't understand that. Where is the file name column?

kimrutherford commented 4 years ago

In the spreadsheet why are there 2 track label rows, ar different ones used in different places? Do we need 2 rows??

Can you give an example? I can't see any like that. There is one row in the file per track.

ValWood commented 4 years ago

I don't understand that. Where is the file name column? Can you give an example? I can't see any like that. There is one row in the file per track.

This is in the submission spreadsheet. We need to settle on the labels (or why we have different ones should be clearer). There might be a reason not to consolidate but it is unclear to me.

mah11 commented 4 years ago

In the spreadsheet [template for user submissions] why are there 2 ~track label~ column header rows ... Do we need 2 rows??

The difference is that row 3 is intended as a user-friendly version of the column header, whereas row 4 is the actual text from the metadata configuration file (pombase_jbrowse_track_metadata.csv). I can remove row 4 if you prefer, but I included it for our convenience because the config file doesn't have exactly the same columns in exactly the same order. It has some columns for details that we won't gather from users (e.g. "display_in_jbrowse"). So if it were up to me I would keep both rows 3 and 4 in the user template to save us looking up which user-friendly text corresponds to which config file header, but I won't die on that hill.

This is exactly the sort of feedback I was after when I sent the DRAFT submission template round, and which should have been resolved before the template went out to any users.

mah11 commented 4 years ago

The column labelled "alleles" for example isn't "alleles" it's genotype class (or something). Changing this label may make it clearer why there is an "allele", and a "mutant" column

I used "WT or mutant" as the user-friendly header for this column in the submission template draft. I can change it to match whatever else you decide to use as the header in the actual browser track selector.

ValWood commented 4 years ago

Happy to keep both rows, but the labels here don't seem to match what is in the display\? I thought they were what. you describe above but then got confused:

I see display labels row3	row4	display
Data Type	Data_type	Data Type
Track label	label	Track Label (display matches row 3)
Assayed Protein	Assayed_gene	localized gene product (matches neither row)
WT or mutant	Alleles	Alleles (display matches row 4)
Mating type	mating_type
Growth phase or response	growth_phase_or_response	Growth phase or response
Mutant_alleles	mutant	(display matches row 4)

what would be a sensible label for the current WT or mutant field? allele type ? (that is what we use in Canto for this classifier)

mah11 commented 4 years ago

At present, row 3 in the user submission template matches the "Contents" column on the documentation web page, and row 4 matches the headers in the metadata config .tsv file.

I have not touched the text used in the browser track selection display, but it would make sense to make template row 3, the documentation table, and the track selection display mutually consistent. I do not have strong preferences for what the consistently used text should be for any column. Once you decide what you prefer to use for each bit of content, it will be clear which bits to change where (and I'll be happy to implement).

what would be a sensible label for the current WT or mutant field? allele type ? (that is what we use in Canto for this classifier)

Specifically for this column, I don't think "allele type" works, unless I'm completely misunderstanding what the column is meant to contain. I think it's to say whether the strain is wild type (except for markers & other background alleles) or has one or more mutations in genes of interest. It's not about whether a particular allele is wild type or mutant (that's why it's not "allele type").

ValWood commented 4 years ago

OK, these are my suggestions, either to normalise differences, or where other changes might be required. For all excluded rows the website instructions and the track labelling are consistent. It did not occur to me that these might be different from each other hence my lack of checking.

	Website (docs)	Track (JBrowse)	row 3	row4	Suggested website AND track header	Notes
2	Track Label	Track label	Track Label	label	Track description?
3	Assayed Protein	Localized gene product	Assayed protein	assayed_gene_product	Assayed gene product
4	Experimental background alleles	Background	Experimental background alleles	background	Experimental background alleles	revert to "background" in browser if this makes columns wider, but I think the words will 'stack'
5	WT or mutant	Alleles	WT or mutant	Alleles	WT or mutant
6	Mutant alleles	Mutant(s)			Mutant alleles
7	Mating Type	Mating-type			Isolate and mating type	I put this because h90 is recorded here so I just used the field for 972 h-
11	Strand	n/a			Strand
16	Study ID	Study ID			Study ID	Currently mandatory, but some datasets may not have a study ID or database (Lanterman curated for example or intron branch points)
17	Database	Database			Database	Currently mandatory, but some datasets may not have a study ID or database (Dutrow curated for example or intron branch points)
19	Data file type	n/a			Data file type
20	File name	n/a			File name

ValWood commented 4 years ago

Also, we should consider aligning the order of the labelling in the documentation, or the track labels so that they easier to checking. Would it be OK to switch the docs and spreadsheet to match the tracks, or is there a reason not to do so?

mah11 commented 4 years ago

OK, these are my suggestions.

We don't need this big complicated table. All I need is what you want displayed to users for each column.

mah11 commented 4 years ago

Would it be OK to switch the docs and spreadsheet to match the tracks, or is there a reason not to do so?

I don't know. I made the submission template match the order on the documentation page, but I never have known whether there were specific reasons for any of the column ordering.

ValWood commented 4 years ago

I put the columns side by side in a spreadsheet to compare them, and then cut and pasted in the spreadsheet- and I wanted a record of which things are changed, so I could quickly see the logic of the change if you decided to criticise it . It was the most efficient way for me to do it.

mah11 commented 4 years ago

Fine. But the most efficient thing for me to use is a simple list of what headings you want users to see.

ValWood commented 4 years ago

That is in "Suggested website AND track header"? but the notes need to be taken into account to0.

ValWood commented 4 years ago

I don't know. I made the submission template match the order on the documentation page, but I never have known whether there were specific reasons for any of the column ordering.

I thought you did the website doc ordering , but in that case I don't see why we can't switch to match the track labels.

mah11 commented 4 years ago

but the notes need to be taken into account to.

to what? (looks like the sentence wasn't complete)

mah11 commented 4 years ago

I thought you did the website doc ordering

Antonia did that page.

ValWood commented 4 years ago

You know very well that is a typo.

mah11 commented 4 years ago

I knew nothing of the sort at the time I read the comment.

mah11 commented 4 years ago

mating type (column 7)

I put this because h90 is recorded here so I just used the field for 972 h-

h90 is a mating type. The config file currently has only mating type designations or null.

For the 968/972/975 part, I am tempted to re-think this column and the background (column 4) ... read on.

Experimental background alleles

In light of column 7 above, I wonder if we should revisit what we intend to show using the "background" column.

First, does mating type need to be separated out into its own column? (@kimrutherford @Antonialock) If so, then we should keep it, and only put mating type designations in it.

If not, and we want to include the parental strain designation in that column, should we use a header more like what the PHAFs have, "Parental strain" or "Background strain" (or "strain background")?

Finally, for the backgound alleles, could we use "Background genotype description" as in PHAF? (I think including "experimental" makes it needlessly wordy.)

mah11 commented 4 years ago

columns 16 & 17

Currently mandatory, but some datasets may not have a study ID or database

Again, I copied what Antonia put on the documentation page. I'd have thought these issues would have been resolved already, when that page went up. Hey ho.

Anyway, the config file has "not available" entered in the study ID column for most of the rows that don't have IDs. I wouldn't mind insisting that the column be filled out for every row, even those where we have to use "not available", but I don't have a strong preference. It doesn't matter much for the template because there's nothing technologically enforcing mandatory-ness. It's just a free-text field.

Oh, and Lanterman has a study ID.

Antonialock commented 4 years ago

mating type (column 7)

I put this because h90 is recorded here so I just used the field for 972 h-

h90 is a mating type. The config file currently has only mating type designations or null.

For the 968/972/975 part, I am tempted to re-think this column and the background (column 4) ... read on.

Experimental background alleles

In light of column 7 above, I wonder if we should revisit what we intend to show using the "background" column.

First, does mating type need to be separated out into its own column? (@kimrutherford @Antonialock) If so, then we should keep it, and only put mating type designations in it. no it can go with background

If not, and we want to include the parental strain designation in that column, should we use a header more like what the PHAFs have, "Parental strain" or "Background strain" (or "strain background")?

Finally, for the backgound alleles, could we use "Background genotype description" as in PHAF? (I think including "experimental" makes it needlessly wordy.)

Fine, it was a placeholder until I could think of something better

ValWood commented 4 years ago

Oh, and Lanterman has a study ID.

That's OK, the study ID already in the table. This should have said Dutrow, but it's arbitrary really, some datasets don't have study IDs or databases.

The point is that the table and schema were devised before many datasets were hosted and the types of data were envisaged, so we need to expect changes to the protocols, they are a work in progress. Antonia did a great job of this - it's orders of magnitude more formalised than any other database browser dataset hosting . The pages haven't been advertised much or used by the community much so far (as is evident by the lack of submissions) so it isn't surprising that nobody notices there are minor inconsistencies.

ValWood commented 4 years ago

h90 is also the name of the standard homothallic strain, so it's a bit ambiguous.

I like Background genotype descriptions and Strain background

I don't think there is any reason to keep "mating type" in its own column. These are data types we devised by us extended from the Ensembl track labels. They aren't a JBrowse requirement.

Probably we wanted to keep the columns widths manageable, but provided that the track labels don't have underscores, they will scale to the longest word

ValWood commented 4 years ago

That's OK, the study ID already in the table. This should have said Dutrow, but it's arbitrary really, some datasets don't have study IDs or databases.

Actually correction. There is a Lantermann transcriptome dataset (that is the one with a study ID), but the curated dataset is also from Lanteramann. So, although there is a data source ID, for one dataset, there is an additional dataset (the one we really need, with manually curated data), which will not have an associated study ID. It is curated from multiple sources.

mah11 commented 4 years ago

I don't think there is any reason to keep "mating type" in its own column.

Fair 'nuff. But what do you want to capture for backgrounds, and does that (whatever it is) actually need to be split over two columns? If there's value in separating them, I can run with phaf-esque "Strain background" and "Background genotype description". But it's worth making sure it's useful to have two columns at all.

Probably we wanted to keep the columns widths manageable, but provided that the track labels don't have underscores, they will scale to the longest word

I'm not really considering column width in any displays. I'm aiming to omit words that don't add substantive content.

mah11 commented 4 years ago

One more question that's simple to ask: in light of our current understanding of incoming data, which columns should be mandatory in the submission template?

ValWood commented 4 years ago

Fair 'nuff. But what do you want to capture for backgrounds, and does that (whatever it is) actually need to be split over two columns? If there's value in separating them, I can run with phaf-esque "Strain background" and "Background genotype description". But it's worth making sure it's useful to have two columns at all.

I don't understand the question. I don't really have a good idea precisely what would be captured, but I envisage it would be similar to backgrounds captured during phenotype annotation. I haven't done enough data set hosting to judge. Basically, for background genotype description we want to capture backgrounds that are relevant to the interpretation of the experiment. Background is displayed only so the person browsing the tracks will know what they are looking at. It isn't used often (I see things like the pat1 mutant to initiate meiosis). I guess people would report background for a track in the same circumstances that they report for a phenotype annotation.

I don't see any need to have separate columns for strain background, genotype background or mating type. The columns can be merged unless anyone can think of a reason why not to. It would be better than displaying lots of usually empty columns.

ValWood commented 4 years ago

One more question that's simple to ask: in light of our current understanding of incoming data, which columns should be mandatory in the submission template?

It looked OK as is, except for the changes to column 16 and 17 reported above.

mah11 commented 4 years ago

I don't understand the question.

It boils down to: What do we need to capture about strains for browser tracks?

A bit more specifically, do browser tracks need the same level of detail as phenotype annotations? (I kind of suspect not.) I'm happy to shove all background details into one "Strain background" column if @kimrutherford thinks that'll be OK (we can have an empty column in the config file temporarily, or even forever, as long as the system can cope with what we put in the background column).

which columns should be mandatory in the submission template?

It looked OK as is, except for the changes to column 16 and 17 reported above.

Do you mean that these columns should be mandatory?

Data type (A), Track label (B), Strain background (D), WT or mutant (E), Growth phase or response (J), Assay type (L), First author (M), Publication year (O), PubMed ID (P), Data file type (S), File name (T)?

If so, that'll be fine to start with; presumably you'll want to make PubMed ID optional if & when we take pre-publication data.

ValWood commented 4 years ago

That sounds fine.

mah11 commented 4 years ago

I've done the bulk of the work for this now. The changes in the metadata file and related files are fairly big and sufficiently scary that I've done them as pull requests, complete with requests for Kim to review them before either of us merges them in.

https://github.com/pombase/pombase-chado/pull/785 https://github.com/pombase/pombase-config/pull/32

I have gone out on the limb of updating the website documentation to match the changes, and added & linked up the metadata submission spreadsheet template. Wheee!

This can close when the pull requests are merged. After that, new tickets for further changes.

mah11 commented 4 years ago

Kim merged and tested yesterday's changes, and says it all looks fine.

pombase / website

suggestions for the browser data form #1473