pombase / website

PomBase website v2
MIT License
6 stars 1 forks source link

downloadable files required for ftp site #355

Closed ValWood closed 7 years ago

ValWood commented 7 years ago

CDS_coordinates, by chromosome - ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/CDS_Coordinates/ exon coordinates - ftp://ftp.pombase.org/pombe/Exon_Coordinates/ see https://github.com/pombase/website/issues/565

genome fasta by chromosome fasta entire genome see https://github.com/pombase/website/issues/566

nucleotide cDNA fasta nucleotide CDS fasta (for some reason these are currently in the genome directory) fasta for non cds features https://rt.sanger.ac.uk/SelfService/Display.html?id=533148 see https://github.com/pombase/website/issues/566

mapping files transferred to https://github.com/pombase/website/issues/564 ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/Mappings/ sysID2product.tsv sysID2product.rna.tsv (current version contins proten coding psudos!) allNames.tsv

ValWood commented 7 years ago

can be done later

protien data

UTR

https://github.com/pombase/website/issues/568

ValWood commented 7 years ago

ftp://ftp.ebi.ac.uk/pub/databases/pombase/FASTA/ i don't know why this stuff is separate, need to decide what we need from here some seems redundant

ValWood commented 7 years ago

I think this is actually quite a complete list. I though I opened a ticket recently (yesterday) to compile this, but I can't find such a ticket

mah11 commented 7 years ago

We have a FAQ that promises a future downloadable

(product type = protein and NOT (Characterisation status = dubious OR Characterisation status = transposon))

kimrutherford commented 7 years ago

fasta by chromosome

I can't find that on PomBase V1. Could you send me the link?

kimrutherford commented 7 years ago

It's probably a good time to reorganise. It's hard to find things. I think we should get rid of the pombe and DATASET sub-directories and move the contents. There are some things to remove as well like "EBeyeXML".

ftp://ftp.pombase.org/

ValWood commented 7 years ago

fasta by chromosome

ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/Chromosome_Dumps/fasta/

from http://www.pombase.org/downloads/genome-datasets but currently empty?

It's probably a good time to reorganise. It's hard to find things.

YES PLEASE!

ValWood commented 7 years ago

the public stuff has always been under ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/

I ddn't discover ftp://ftp.ebi.ac.uk/pub/databases/pombase/ until recently and we have always made sure that the stuff linked from downloads downloads is under pombase/pombe/

So we only need to pick the stuff we need from pombase/ all other directories

The stuff hosted in the genome browser is here....that's "DATASETS"

The only other useful thing is ftp://ftp.ebi.ac.uk/pub/databases/pombase/binding_motifs.bed which should be in DATASETS

I did once look in the DATASETS directory and notice that it was a bit random.... We can try to rationalise this after public release when you work on JBrowse... It would be great if we could reach a situation where we can get people to upload their browser trak data in the desired format, and the correct labels, and we don't need to do much but police that it is done correctly....

kimrutherford commented 7 years ago

I suggest we move everything on the new FTP site into an "OLD" directory then move things out as we decide where to put them. We can do that live on a Skype call if you like. I can move and rename things as we decide.

There are a few directories that I think we should rename. "DATASETS" is one of them. I think it would be clearer to call it something like "external_datasets" or "community_datasets" to make it clear that they aren't things that we have generated.

ValWood commented 7 years ago

Yes it would be pretty easy by Skype. Lets do it on this weeks call.

kimrutherford commented 7 years ago

nucleotide cDNA fasta

The current site has three files:

cdna_introns_utrs.fa.gz
cdna_nointrons_noutrs.fa.gz
cdna_nointrons_utrs.fa.gz

Are all three needed?

ftp://ftp.pombase.org/FASTA/

ValWood commented 7 years ago

Hmm, not sure. We would have created these because they were asked for, and I could see all 3 being used....

Although the UTR ones are a bit arbitrary because you would never know if a gene had them (and they are often incorrect).

I would have thought that cdna_introns_noutrs.fa.gz was a more common choice but we don't provide that.

Thoughts anyone?

kimrutherford commented 7 years ago

They're easy enough to generate if they're needed.

Perhaps we can improve the file names though. This one for example gives me a headache:

cdna_nointrons_noutrs.fa.gz

We shouldn't need to say "nointrons" because a cDNA file shouldn't have introns in it. And if it hasn't got UTRs it's not cDNA.

I would call that cds.fa or coding_sequences.fa

ValWood commented 7 years ago

yes you are correct.

Lets go through all the ftp files on the call. I reckon we can do that in an hour....

mah11 commented 7 years ago

More things I've just noticed now: the documents linked on the legacy-pombase Documents page (http://www.pombase.org/downloads/documents) are a mix of things in the FTP "Archived directories" department and files uploaded in that sodding Drupal system. Can they all have homes somewhere in shiny-new-pombase FTP?

Let me know if you want me to make a list, or if there's any other way I can help.

kimrutherford commented 7 years ago

I've been chipping away at generating these files from Chado each night. There are just three files so far, but I'll add more: https://curation.pombase.org/dumps/latest_build/fasta/

ValWood commented 7 years ago

This will be so good. imagine all of the constant checking we won't need to do. The files will just BE THERE!

Can we line wrap at the normal number (60?)

v

kimrutherford commented 7 years ago

Can we line wrap at the normal number (60?)

Yep, will do.

kimrutherford commented 7 years ago

sysID2product.rna.tsv (current version contins proten coding psudos!)

Just to check, should that contain every type of RNA?

ValWood commented 7 years ago

Immediately I didn't think it should contain protein coding pseudo....except that if these do still make mRNA it does sometimes have a regulatory role....so perhaps the non protein coding pseudo RNAs should be in here?

kimrutherford commented 7 years ago

non protein coding pseudo RNAs

Do we have any of those? Could you point me at an example?

ValWood commented 7 years ago

Sorry I didn't explain properly. ncRNAs by definition are RNAs which do not encode proteins. If we have an mRNA which no longer encodes a protein, I guess it could be classed as a ncRNA...

We have 28 genes flagged as pseudo. if mist RNAs related to proteins have some regulatory role (and as most of these pseudos have very close paralogs), it is possible, even likely,that the RNA are regulatory in some capacity.

I'm thinking that the ncRNA file might be the best place for these to be....

kimrutherford commented 7 years ago

Can we line wrap at the normal number (60?)

That's done and will be in tomorrow night's load.

kimrutherford commented 7 years ago

Some of the files in your list are now generated nightly. They appear in this FTP directory: ftp://ftp.pombase.org/nightly_update/

These files are now generated nightly:

These two files:

appear here: ftp://ftp.pombase.org/nightly_update/misc/

At the moment none of the nightly generated files are copied to ftp://ftp.pombase.org/pombe/ where the rest of the files are.

ValWood commented 7 years ago

All in new tickets