tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics
BSD 2-Clause "Simplified" License
742 stars 187 forks source link

failed to extract correct .bed file from gff file which is generated from prokka : python -m jcvi.formats.gff bed #634

Closed yyyuechen closed 3 months ago

yyyuechen commented 7 months ago

Dear developer, Thank you so much for developing this amazing tool, while I have succeed in running the example (grape and peach), I have issues in running my own samples, here are my codes:

first I run this : python -m jcvi.formats.gff bed --type=CDS --key=ID 10_Arc_ANME-1_bin.93_prokka.gff -o anme_10_93.bed [03/05/24 00:36:23] DEBUG Load file 10_Arc_ANME-1_bin.93_prokka.gff
[00:36:24] DEBUG Extracted 2007 features (type=CDS id=ID parent=Parent)

and then: python -m jcvi.compara.catalog ortholog anme_10_93 ref_10_93 --cscore=.99 --no_strip_names --notex and I have this : WARNING k141_764561 not in anme_10_93.bed
WARNING k141_435926 not in anme_10_93.bed
WARNING k141_1553865 not in anme_10_93.bed
... DEBUG A total of 0 anchor was found. Aborted.

I have this error indicating that my .bed file doesn't match with cds

so I check my .bed file and compare with my .gff file and they are like this:

.gff

sequence-region k141_1236778 1 3185

sequence-region k141_395534 1 2305

k141_1435079 Prodigal:002006 CDS 278 2164 . - 0
ID=FANGMIIN_00001;inference=ab initio prediction:Prodigal:002006; ocus_tag=FANGMIIN_00001;product=hypothetical protein

.bed

bed k141_1629 111 417 FANGMIIN_00099 0 - k141_1629 666 924 FANGMIIN_00100 0 + k141_1629 1054 1213 FANGMIIN_00101 0 -

.fasta

fasta k141_1435079 GTGTCTGATTGTTTTGTGATTGCGTCTTCGTAATTTTTAGAGCCTGTTTCGTCCTTCAAC GCGTGTAGTAGTTGGAGTGTTGATTGTTCTATTTTTTTTGTCACCGCCGAACTTGTCTTT

.cds

k141_1435079 GTGTCTGATTGTTTTGTGATTGCGTCTTCGTAATTTTTAGAGCCTGTTTCGTCCTTCAAC

I tried several type=CDS and key in python -m jcvi.formats.gff bed --type=CDS --key=ID 10_Arc_ANME-1_bin.93_prokka.gff -o anme_10_93.bed, but couldnt solve this problem Do you have any suggestions on what I can try to fix this issue? Thank you so much in advance!!!

z626093820 commented 7 months ago

python -m jcvi.formats.gff bed --type=gene --key=ID in.gff -o out.bed 是不是应该将type改为mRNA,gene,transcript,一个基因有好几个CDS,为什么单独提取所有CDS

yyyuechen commented 7 months ago

python -m jcvi.formats.gff bed --type=gene --key=ID in.gff -o out.bed 是不是应该将type改为mRNA,gene,transcript,一个基因有好几个CDS,为什么单独提取所有CDS

因为我的gff文件里只有cds信息 提取gene或者mrna的话就会提示Extracted 0 features。 我发现之前这么多warning是因为fasta的基因名和bed文件第四列不符合,修改之后虽然warning数量减少了 但还是有一个not in bed file,后续得到这样的结果: WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 DEBUG running the cscore filter (cscore>=0.99) blastfilter.py:108 ..
DEBUG after filter (4636->101) .. blastfilter.py:110 DEBUG running the local dups filter blastfilter.py:115 (tandem_Nmax=10) ..
DEBUG after filter (101->101) .. blastfilter.py:155 DEBUG Assuming --qbed=anme_10_93.bed synteny.py:390 --sbed=ref_10_93.bed
DEBUG Load file anme_10_93.bed base.py:33 DEBUG Load file ref_10_93.bed base.py:33 DEBUG Load file base.py:33 anme_10_93.ref_10_93.last.filtered
DEBUG A total of 101 BLAST imported from synteny.py:277 anme_10_93.ref_10_93.last.filtered.
DEBUG Chaining distance = 20 synteny.py:1794 DEBUG Load file anme_10_93.ref_10_93.anchors base.py:33 DEBUG A total of 0 anchor was found. Aborted. synteny.py:1371

我在想是不是我的上游文件gff有问题🤨 可我已经有101 BLAST match了

z626093820 commented 7 months ago

python -m jcvi.formats.gff bed --type=gene --key=ID in.gff -o out.bed 是不是应该将type改为mRNA,gene,transcript,一个基因有好几个CDS,为什么单独提取所有CDS

因为我的gff文件里只有cds信息 提取gene或者mrna的话就会提示Extracted 0 features。 我发现之前这么多warning是因为fasta的基因名和bed文件第四列不符合,修改之后虽然warning数量减少了 但还是有一个not in bed file,后续得到这样的结果: WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 WARNING k141_1388814 not in anme_10_93.bed blastfilter.py:65 DEBUG running the cscore filter (cscore>=0.99) blastfilter.py:108 .. DEBUG after filter (4636->101) .. blastfilter.py:110 DEBUG running the local dups filter blastfilter.py:115 (tandem_Nmax=10) .. DEBUG after filter (101->101) .. blastfilter.py:155 DEBUG Assuming --qbed=anme_10_93.bed synteny.py:390 --sbed=ref_10_93.bed DEBUG Load file anme_10_93.bed base.py:33 DEBUG Load file ref_10_93.bed base.py:33 DEBUG Load file base.py:33 anme_10_93.ref_10_93.last.filtered DEBUG A total of 101 BLAST imported from synteny.py:277 anme_10_93.ref_10_93.last.filtered. DEBUG Chaining distance = 20 synteny.py:1794 DEBUG Load file anme_10_93.ref_10_93.anchors base.py:33 DEBUG A total of 0 anchor was found. Aborted. synteny.py:1371

我在想是不是我的上游文件gff有问题🤨 可我已经有101 BLAST match了

你看一下gff文件格式,感觉像gff的问题