Closed ValWood closed 7 years ago
can be done later
protien data
UTR
ftp://ftp.ebi.ac.uk/pub/databases/pombase/FASTA/ i don't know why this stuff is separate, need to decide what we need from here some seems redundant
I think this is actually quite a complete list. I though I opened a ticket recently (yesterday) to compile this, but I can't find such a ticket
We have a FAQ that promises a future downloadable
(product type = protein and NOT (Characterisation status = dubious OR Characterisation status = transposon))
fasta by chromosome
I can't find that on PomBase V1. Could you send me the link?
It's probably a good time to reorganise. It's hard to find things. I think we should get rid of the pombe and DATASET sub-directories and move the contents. There are some things to remove as well like "EBeyeXML".
ftp://ftp.pombase.org/
fasta by chromosome
ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/Chromosome_Dumps/fasta/
from http://www.pombase.org/downloads/genome-datasets but currently empty?
It's probably a good time to reorganise. It's hard to find things.
YES PLEASE!
the public stuff has always been under ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/
I ddn't discover ftp://ftp.ebi.ac.uk/pub/databases/pombase/ until recently and we have always made sure that the stuff linked from downloads downloads is under pombase/pombe/
So we only need to pick the stuff we need from pombase/ all other directories
The stuff hosted in the genome browser is here....that's "DATASETS"
The only other useful thing is ftp://ftp.ebi.ac.uk/pub/databases/pombase/binding_motifs.bed which should be in DATASETS
I did once look in the DATASETS directory and notice that it was a bit random.... We can try to rationalise this after public release when you work on JBrowse... It would be great if we could reach a situation where we can get people to upload their browser trak data in the desired format, and the correct labels, and we don't need to do much but police that it is done correctly....
I suggest we move everything on the new FTP site into an "OLD" directory then move things out as we decide where to put them. We can do that live on a Skype call if you like. I can move and rename things as we decide.
There are a few directories that I think we should rename. "DATASETS" is one of them. I think it would be clearer to call it something like "external_datasets" or "community_datasets" to make it clear that they aren't things that we have generated.
Yes it would be pretty easy by Skype. Lets do it on this weeks call.
nucleotide cDNA fasta
The current site has three files:
cdna_introns_utrs.fa.gz
cdna_nointrons_noutrs.fa.gz
cdna_nointrons_utrs.fa.gz
Are all three needed?
ftp://ftp.pombase.org/FASTA/
Hmm, not sure. We would have created these because they were asked for, and I could see all 3 being used....
Although the UTR ones are a bit arbitrary because you would never know if a gene had them (and they are often incorrect).
I would have thought that cdna_introns_noutrs.fa.gz was a more common choice but we don't provide that.
Thoughts anyone?
They're easy enough to generate if they're needed.
Perhaps we can improve the file names though. This one for example gives me a headache:
cdna_nointrons_noutrs.fa.gz
We shouldn't need to say "nointrons" because a cDNA file shouldn't have introns in it. And if it hasn't got UTRs it's not cDNA.
I would call that cds.fa
or coding_sequences.fa
yes you are correct.
Lets go through all the ftp files on the call. I reckon we can do that in an hour....
More things I've just noticed now: the documents linked on the legacy-pombase Documents page (http://www.pombase.org/downloads/documents) are a mix of things in the FTP "Archived directories" department and files uploaded in that sodding Drupal system. Can they all have homes somewhere in shiny-new-pombase FTP?
Let me know if you want me to make a list, or if there's any other way I can help.
I've been chipping away at generating these files from Chado each night. There are just three files so far, but I'll add more: https://curation.pombase.org/dumps/latest_build/fasta/
This will be so good. imagine all of the constant checking we won't need to do. The files will just BE THERE!
Can we line wrap at the normal number (60?)
v
Can we line wrap at the normal number (60?)
Yep, will do.
sysID2product.rna.tsv (current version contins proten coding psudos!)
Just to check, should that contain every type of RNA?
Immediately I didn't think it should contain protein coding pseudo....except that if these do still make mRNA it does sometimes have a regulatory role....so perhaps the non protein coding pseudo RNAs should be in here?
non protein coding pseudo RNAs
Do we have any of those? Could you point me at an example?
Sorry I didn't explain properly. ncRNAs by definition are RNAs which do not encode proteins. If we have an mRNA which no longer encodes a protein, I guess it could be classed as a ncRNA...
We have 28 genes flagged as pseudo. if mist RNAs related to proteins have some regulatory role (and as most of these pseudos have very close paralogs), it is possible, even likely,that the RNA are regulatory in some capacity.
I'm thinking that the ncRNA file might be the best place for these to be....
Can we line wrap at the normal number (60?)
That's done and will be in tomorrow night's load.
Some of the files in your list are now generated nightly. They appear in this FTP directory: ftp://ftp.pombase.org/nightly_update/
These files are now generated nightly:
These two files:
appear here: ftp://ftp.pombase.org/nightly_update/misc/
At the moment none of the nightly generated files are copied to ftp://ftp.pombase.org/pombe/ where the rest of the files are.
All in new tickets
CDS_coordinates, by chromosome - ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/CDS_Coordinates/ exon coordinates - ftp://ftp.pombase.org/pombe/Exon_Coordinates/ see https://github.com/pombase/website/issues/565
genome fasta by chromosome fasta entire genome see https://github.com/pombase/website/issues/566
[ ] gff (We need to list the features required in the gff file) needs to include see https://github.com/pombase/website/issues/567
[ ] manually curated LTRs from https://github.com/pombase/website/issues/61
[x] chromosome contigs (already available)
nucleotide cDNA fasta nucleotide CDS fasta (for some reason these are currently in the genome directory) fasta for non cds features https://rt.sanger.ac.uk/SelfService/Display.html?id=533148 see https://github.com/pombase/website/issues/566
[ ] complex file described here https://github.com/pombase/website/issues/273
[x] GAF (already available)
[x] Phaf (already available)
[x] modifications (already available)
[x] orthologs (already available)
[x] HCPIN datasets (already available) ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/exports/
mapping files transferred to https://github.com/pombase/website/issues/564 ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/Mappings/ sysID2product.tsv sysID2product.rna.tsv (current version contins proten coding psudos!) allNames.tsv