sebhtml / Paper-Replication-2012

2 stars 4 forks source link

Cannot retrieve cds.fasta.gz sequences (Build-Input-Files-for-Gene-ontology/Main.sh) #2

Open ashishdamania opened 9 years ago

ashishdamania commented 9 years ago

Embl seems to have change their file structure http://www.ebi.ac.uk/about/news/service-news/change-cds-ftp-products so the script Main.sh does not work as intended. Not exactly sure about this one: Rebuild-Fasta.py gives out of range error probably because the script anticipates ":" for the sequences but now it has "|" in the sequences.

sebhtml commented 9 years ago

Is it possible to retrieve the old files from EBI ?

ashishdamania commented 9 years ago

I tried looking at their ftp site and best I could find was this: ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/release/std/fasta/ which I assume what we need. Correction: May be this is the correct file: ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/emblcds/emblcds.gz

ashishdamania commented 9 years ago

I tried retrieving sequences from ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/emblcds/emblcds.gz and they are formatted as shown below

>EMBLCDS:BAJ49870 BAJ49870.1 Candidatus Caldiarchaeum subterraneum archaeal cell division control protein 6 

So the issue is just with the FTP address and rest of your script still holds fine.

So the line 43 in /Build-Input-Files-for-Gene-Ontology / Main.sh should be changed from ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/cds.fasta.gz to ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/emblcds/emblcds.gz

valdeanda commented 8 years ago

Also in the line 20 , should be changed in the Main.sh script : from wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz to wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz

and instead of gunzip gene_association.goa_uniprot.gz to gunzip goa_uniprot_all.gaf.gz