lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
485 stars 133 forks source link

Backlocate w/ E. coli genome #225

Open ibrahimkurt opened 1 year ago

ibrahimkurt commented 1 year ago

Hi,

Is it possible to use backlocate w/ E. coli files? I have the following file from NCBI as the GTF file. And I have the appropriate reference file with the indices. I am trying to figure out how to call the genes/transcripts.

genomic.gtf.zip

There are RefSeq gene, CDS, start_codon and stop_codon entries per gene. When I try to call using the transcript_id or gene attributes I get the error:

echo -e "thrL\tP1090M" | java -jar backlocate.jar -R ../../Documents/CRISPR\ CBE\ Bacteria\ STOP\ Library/GCA_000005845.2.fasta --gtf ../../Documents/CRISPR\ CBE\ Bacteria\ STOP\ Library/from\ NCBI/GCF_000005845.2/ncbi_dataset/data/GCF_000005845.2/genomic.gtf

[WARN][BackLocate]no transcript found for thrL

Thanks a lot for your help.

lindenb commented 1 year ago

The tool expects gene_name as the attribute for the gene, not gene. see https://github.com/lindenb/jvarkit/blob/master/src/test/resources/gencode.v19.annotation.gtf .

Use sed to "fix" your gtf

ibrahimkurt commented 1 year ago

Thanks a lot for your prompt reply. I replaced all gene with gene_name but I am still getting the same error. Could it be because there are only gene, CDS, start_codon and stop_codon entries but not transcript per gene? I manually copied and pasted a gene entry block and replaced the feature with transcript as shown below, but I still got the same error saying [WARN][BackLocate]no transcript found for thrL

Do you have any ideas? Thanks!

NC_000913.3 RefSeq  gene    190 255 .   +   .   gene_id "b0001"; transcript_id ""; db_xref "ASAP:ABE-0000006"; db_xref "ECOCYC:EG11277"; db_xref "GeneID:944742"; gbkey "Gene"; gene_name "thrL"; gene_biotype "protein_coding"; gene_synonym "ECK0001"; locus_tag "b0001"; 
NC_000913.3 RefSeq  transcript  190 255 .   +   .   gene_id "b0001"; transcript_id ""; db_xref "ASAP:ABE-0000006"; db_xref "ECOCYC:EG11277"; db_xref "GeneID:944742"; gbkey "Gene"; gene_name "thrL"; gene_biotype "protein_coding"; gene_synonym "ECK0001"; locus_tag "b0001"; 
NC_000913.3 RefSeq  CDS 190 252 .   +   0   gene_id "b0001"; transcript_id "gnl|b0001|mrna.NP_414542"; db_xref "UniProtKB/Swiss-Prot:P0AD86"; gbkey "CDS"; gene_name "thrL"; locus_tag "b0001"; orig_transcript_id "gnl|b0001|mrna.NP_414542"; product "thr operon leader peptide"; protein_id "NP_414542.1"; transl_table "11"; exon_number "1"; 
NC_000913.3 RefSeq  start_codon 190 192 .   +   0   gene_id "b0001"; transcript_id "gnl|b0001|mrna.NP_414542"; db_xref "UniProtKB/Swiss-Prot:P0AD86"; gbkey "CDS"; gene_name "thrL"; locus_tag "b0001"; orig_transcript_id "gnl|b0001|mrna.NP_414542"; product "thr operon leader peptide"; protein_id "NP_414542.1"; transl_table "11"; exon_number "1"; 
NC_000913.3 RefSeq  stop_codon  253 255 .   +   0   gene_id "b0001"; transcript_id "gnl|b0001|mrna.NP_414542"; db_xref "UniProtKB/Swiss-Prot:P0AD86"; gbkey "CDS"; gene_name "thrL"; locus_tag "b0001"; orig_transcript_id "gnl|b0001|mrna.NP_414542"; product "thr operon leader peptide"; protein_id "NP_414542.1"; transl_table "11"; exon_number "1"; 
lindenb commented 1 year ago

set transcript_id to "gnl|b0001|mrna.NP_414542" for the transcript.