Closed ValWood closed 2 years ago
I'd prefer to make a new file in the one line per gene format because otherwise we'll need to change the ortholog reading code. That file (pombe-cerevisiae-orthologs-with-systematic-ids.txt.gz) is the one read into JaponicusDB. That's why the file was created.
I think we should put the new file here: https://curation.pombase.org/dumps/latest_build/ rather than in the exports directory because that's where the equivalent human file lives.
pombe-cerevisiae-orthologs-with-systematic-ids.txt.gz and pombe-human-orthologs-with-systematic-ids.txt.gz are in the same format - one line per ortholog, using systematic IDs for all organisms. I think it makes sense to keep them in the same format. And it's a useful format for parsing.
Sorry about that. I've put it back manually and it should be automatic from now on. I checked in my change after the load had started but didn't realise.
The new file used S. c gene names. would it be better to use the systematic (locus) I d, this is what I have always used historically (should not change, gene names change occasionally, and of course new ones are added)
In fact, in the current situation I see SPAC14C4.12c NONE|FUN19 If no primary name is assigned.
"NONE" usually refers to "no S. cerevisiae ortholog"
The new file used S. c gene names
It's currently the file is generated by the same code the makes pombase-latest.human-orthologs.txt.gz, which has gene names.
"NONE" usually refers to "no S. cerevisiae ortholog"
That's probably happening because some of the cerevisiae genes have no name. I'll add a flag to the exporter to make it use a different identifier for cerevisiae.
It's a shame we need a different format for the two species.
would it be better to use the systematic (locus)
Is that the Y- codes or the S- codes?
We use the Y code in the old versions. I think we should stick with that because the community use them, but they don't really use the SGD: identifiers.
We have a general problem in the community with these name spaces, and so historically different communities use different labels for certain types if names . locus ID vs gene name etc. In fact somebody from the HGNC is bringing up the issue at tteh biocurattr meeting and I intend to flag it at the elixir biocuration work group.
in the meantime I think we should continue to use the locusID for S. cerevisiae and the HGNC assigned name for human (because every human protein has an assigned name). sorry I realise this is a bit of a pain....
The new file description is in a different ticket. https://github.com/pombase/website/issues/1863#issuecomment-1094438573
(Putting this here for Vivian , because I forgot we had 2 separate tickets)
@VivianMonzon
OK. I've manually updated this file with the locus IDs: https://curation.pombase.org/dumps/latest_build/pombase-latest.cerevisiae-orthologs.txt.gz
From tomorrow it will happen automatically.
OK thanks. From my cursory checks the new file seems to be good. I think I made all of the recent additions and changes in the correct place. Phew !
I put the old files in the old... subdirectory in svn.
Will the updated and new files also appear here: https://www.pombase.org/data/orthologs/
Will the updated and new files also appear here: https://www.pombase.org/data/orthologs/
Sorry, I forgot to do that.
I've changed the nightly update to put the files in that directory, with one in each format for human and cerevisiae. The file names are currently:
but I admit those names aren't great.
The files will be there on Thursday morning.
I've updated the README too. (Changes will be visible tomorrow)
Make the table here: https://curation.pombase.org/dumps/latest_build/exports/ the same format as the human table (one line per pombe gene)
Then I need to check diff between the old and the new file https://www.pombase.org/data/orthologs/ (this is the file I create manually)
Historically I used to make a list of changes in the track wiki, and make a release once there were a bunch of changes. I think since the wiki disappeared I have been making the changes in artemis, (but not recording them in track).
If I can check that the files all line up we can move to the automated file export. The only reason I kept using the old version was because it had information about the tandem repeats. I think this info could just go in a README file....