pombe /cerevisiae ortholog table

ValWood commented 2 years ago

Make the table here: https://curation.pombase.org/dumps/latest_build/exports/ the same format as the human table (one line per pombe gene)
Then I need to check diff between the old and the new file https://www.pombase.org/data/orthologs/ (this is the file I create manually)

Historically I used to make a list of changes in the track wiki, and make a release once there were a bunch of changes. I think since the wiki disappeared I have been making the changes in artemis, (but not recording them in track).

If I can check that the files all line up we can move to the automated file export. The only reason I kept using the old version was because it had information about the tandem repeats. I think this info could just go in a README file....

kimrutherford commented 2 years ago

I'd prefer to make a new file in the one line per gene format because otherwise we'll need to change the ortholog reading code. That file (pombe-cerevisiae-orthologs-with-systematic-ids.txt.gz) is the one read into JaponicusDB. That's why the file was created.

I think we should put the new file here: https://curation.pombase.org/dumps/latest_build/ rather than in the exports directory because that's where the equivalent human file lives.

pombe-cerevisiae-orthologs-with-systematic-ids.txt.gz and pombe-human-orthologs-with-systematic-ids.txt.gz are in the same format - one line per ortholog, using systematic IDs for all organisms. I think it makes sense to keep them in the same format. And it's a useful format for parsing.

kimrutherford commented 2 years ago

I've added https://curation.pombase.org/dumps/latest_build/pombase-latest.cerevisiae-orthologs.txt.gz

ValWood commented 2 years ago

Will https://curation.pombase.org/dumps/latest_build/pombase-latest.cerevisiae-orthologs.txt.gz appear tomorrow?

kimrutherford commented 2 years ago

Sorry about that. I've put it back manually and it should be automatic from now on. I checked in my change after the load had started but didn't realise.

ValWood commented 2 years ago

The new file used S. c gene names. would it be better to use the systematic (locus) I d, this is what I have always used historically (should not change, gene names change occasionally, and of course new ones are added)

ValWood commented 2 years ago

In fact, in the current situation I see SPAC14C4.12c NONE|FUN19 If no primary name is assigned.

ValWood commented 2 years ago

"NONE" usually refers to "no S. cerevisiae ortholog"

kimrutherford commented 2 years ago

The new file used S. c gene names

It's currently the file is generated by the same code the makes pombase-latest.human-orthologs.txt.gz, which has gene names.

"NONE" usually refers to "no S. cerevisiae ortholog"

That's probably happening because some of the cerevisiae genes have no name. I'll add a flag to the exporter to make it use a different identifier for cerevisiae.

It's a shame we need a different format for the two species.

would it be better to use the systematic (locus)

Is that the Y- codes or the S- codes?

ValWood commented 2 years ago

We use the Y code in the old versions. I think we should stick with that because the community use them, but they don't really use the SGD: identifiers.

We have a general problem in the community with these name spaces, and so historically different communities use different labels for certain types if names . locus ID vs gene name etc. In fact somebody from the HGNC is bringing up the issue at tteh biocurattr meeting and I intend to flag it at the elixir biocuration work group.

in the meantime I think we should continue to use the locusID for S. cerevisiae and the HGNC assigned name for human (because every human protein has an assigned name). sorry I realise this is a bit of a pain....

ValWood commented 2 years ago

The new file description is in a different ticket. https://github.com/pombase/website/issues/1863#issuecomment-1094438573

(Putting this here for Vivian , because I forgot we had 2 separate tickets)

ValWood commented 2 years ago

@VivianMonzon

kimrutherford commented 2 years ago

OK. I've manually updated this file with the locus IDs: https://curation.pombase.org/dumps/latest_build/pombase-latest.cerevisiae-orthologs.txt.gz

From tomorrow it will happen automatically.

ValWood commented 2 years ago

OK thanks. From my cursory checks the new file seems to be good. I think I made all of the recent additions and changes in the correct place. Phew !

I put the old files in the old... subdirectory in svn.

Will the updated and new files also appear here: https://www.pombase.org/data/orthologs/

kimrutherford commented 2 years ago

Will the updated and new files also appear here: https://www.pombase.org/data/orthologs/

Sorry, I forgot to do that.

I've changed the nightly update to put the files in that directory, with one in each format for human and cerevisiae. The file names are currently:

pombe-cerevisiae-orthologs-one-line-per-gene.tsv
pombe-cerevisiae-orthologs.tsv
pombe-human-orthologs-one-line-per-gene.tsv
pombe-human-orthologs.tsv

but I admit those names aren't great.

The files will be there on Thursday morning.

kimrutherford commented 2 years ago

I've updated the README too. (Changes will be visible tomorrow)

pombase / pombase-chado

pombe /cerevisiae ortholog table #960