velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
160 stars 83 forks source link

GTF: AttributeError: 'NoneType' object has no attribute 'group' #328

Open bbimber opened 2 years ago

bbimber commented 2 years ago

Hello,

I'm trying to run velocyto using 10x data and GTFs for macaca mulatta from NCBI. The command is:

velocyto run -o ./outs -b barcodes.tsv.gz --samtools-threads 11 -m mm10_rmsk.rename.gtf 438-10-GEX.bam NCBI.Mmul_10.103.translated.gtf

It gives the following error, which I assume is something related to a line in the GTF from MT? Have you encountered problems like this before?

24 Mar 2022 05:50:21,736 DEBUG:     2022-03-24 05:50:21,438 - DEBUG - Assigning indexes to genes
24 Mar 2022 05:50:21,740 DEBUG:     2022-03-24 05:50:21,438 - DEBUG - Seen 32538 genes until now
24 Mar 2022 05:50:21,745 DEBUG:     2022-03-24 05:50:21,438 - DEBUG - Parsing Chromosome MT strand - [line 1978045]
24 Mar 2022 05:50:21,749 DEBUG:     /home/exacloud/gscratch/prime-seq/bin/primeseq-python/lib/python3.6/site-packages/numba/np/ufunc/parallel.py:365: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 6103. The TBB threading layer is disabled.
24 Mar 2022 05:50:21,754 DEBUG:       warnings.warn(problem)
24 Mar 2022 05:50:21,759 DEBUG:     Traceback (most recent call last):
24 Mar 2022 05:50:21,764 DEBUG:       File "/home/exacloud/gscratch/prime-seq/bin/primeseq-python/bin/velocyto", line 8, in <module>
24 Mar 2022 05:50:21,771 DEBUG:         sys.exit(cli())
24 Mar 2022 05:50:21,777 DEBUG:       File "/home/exacloud/gscratch/prime-seq/bin/primeseq-python/lib/python3.6/site-packages/click/core.py", line 829, in __call__
24 Mar 2022 05:50:21,783 DEBUG:         return self.main(*args, **kwargs)
24 Mar 2022 05:50:21,788 DEBUG:       File "/home/exacloud/gscratch/prime-seq/bin/primeseq-python/lib/python3.6/site-packages/click/core.py", line 782, in main
24 Mar 2022 05:50:21,793 DEBUG:         rv = self.invoke(ctx)
24 Mar 2022 05:50:21,797 DEBUG:       File "/home/exacloud/gscratch/prime-seq/bin/primeseq-python/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
24 Mar 2022 05:50:21,802 DEBUG:         return _process_result(sub_ctx.command.invoke(sub_ctx))
24 Mar 2022 05:50:21,809 DEBUG:       File "/home/exacloud/gscratch/prime-seq/bin/primeseq-python/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
24 Mar 2022 05:50:21,815 DEBUG:         return ctx.invoke(self.callback, **ctx.params)
24 Mar 2022 05:50:21,821 DEBUG:       File "/home/exacloud/gscratch/prime-seq/bin/primeseq-python/lib/python3.6/site-packages/click/core.py", line 610, in invoke
24 Mar 2022 05:50:21,826 DEBUG:         return callback(*args, **kwargs)
24 Mar 2022 05:50:21,830 DEBUG:       File "/home/exacloud/gscratch/prime-seq/bin/primeseq-python/lib/python3.6/site-packages/velocyto/commands/run.py", line 116, in run
24 Mar 2022 05:50:21,835 DEBUG:         samtools_memory=samtools_memory, dump=dump, loom_numeric_dtype=dtype, verbose=verbose, additional_ca=additional_ca)
24 Mar 2022 05:50:21,840 DEBUG:       File "/home/exacloud/gscratch/prime-seq/bin/primeseq-python/lib/python3.6/site-packages/velocyto/commands/_run.py", line 186, in _run
24 Mar 2022 05:50:21,845 DEBUG:         annotations_by_chrm_strand = exincounter.read_transcriptmodels(gtffile)
24 Mar 2022 05:50:21,850 DEBUG:       File "/home/exacloud/gscratch/prime-seq/bin/primeseq-python/lib/python3.6/site-packages/velocyto/counter.py", line 514, in read_transcriptmodels
24 Mar 2022 05:50:21,855 DEBUG:         geneid = regex_geneid.search(tags).group(1)
24 Mar 2022 05:50:21,859 DEBUG:     AttributeError: 'NoneType' object has no attribute 'group'
24 Mar 2022 06:00:41,530 DEBUG:     [bam_sort_core] merging from 143 files and 11 in-memory blocks...
24 Mar 2022 06:00:41,549 WARN :     process exited with non-zero value: 1
bbimber commented 2 years ago

I am guessing lines like this break it:

MT  RefSeq  exon    536 607 .   +   .   gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Phe"; exon_number "1"; 

note the empty string gene_id. This is just a guess at the moment. However, this GTF is verbatim from NCBI, so it might be nice if velocyto could work with (perhaps just ingore) these sort of lines.

topafzd commented 2 years ago

Hi, I also stuck in this place. error information as: Traceback (most recent call last): File "/ism/zhangyanan/.conda/envs/velocyto/bin/velocyto", line 8, in sys.exit(cli()) File "/ism/zhangyanan/.conda/envs/velocyto/lib/python3.7/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/ism/zhangyanan/.conda/envs/velocyto/lib/python3.7/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/ism/zhangyanan/.conda/envs/velocyto/lib/python3.7/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/ism/zhangyanan/.conda/envs/velocyto/lib/python3.7/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/ism/zhangyanan/.conda/envs/velocyto/lib/python3.7/site-packages/click/core.py", line 760, in invoke return __callback(args, **kwargs) File "/ism/zhangyanan/.conda/envs/velocyto/lib/python3.7/site-packages/velocyto/commands/run.py", line 116, in run samtools_memory=samtools_memory, dump=dump, loom_numeric_dtype=dtype, verbose=verbose, additional_ca=additional_ca) File "/ism/zhangyanan/.conda/envs/velocyto/lib/python3.7/site-packages/velocyto/commands/_run.py", line 186, in _run annotations_by_chrm_strand = exincounter.read_transcriptmodels(gtffile) File "/ism/zhangyanan/.conda/envs/velocyto/lib/python3.7/site-packages/velocyto/counter.py", line 514, in read_transcriptmodels geneid = regex_geneid.search(tags).group(1) AttributeError: 'NoneType' object has no attribute 'group'

And I checked the log, it stoped when velocyto begin to parse chromosomes: 2022-04-02 20:50:10,198 - INFO - No SAMPLEID specified, the sample will be called sort_V2R2109240074_anno_decon_change_CB_LOJQD (last 5 digits are a random-id to avoid overwriting some other file by mistake) 2022-04-02 20:50:10,199 - DEBUG - Using logic: Default 2022-04-02 20:50:10,250 - INFO - Read 15986 cell barcodes from /gluster/home/chendongsheng/zhangyanan/velocyto/test_09240074/09240074_barcodes.tsv 2022-04-02 20:50:10,250 - DEBUG - Example of barcode: CELL12028_N1 and cell_id: sort_V2R2109240074_anno_decon_change_CB_LOJQD:CELL12028_N1 2022-04-02 20:50:10,412 - DEBUG - Peeking into /gluster/home/chendongsheng/zhangyanan/velocyto/test_09240074/sort_V2R2109240074_anno_decon_change_CB.bam 2022-04-02 20:50:10,585 - INFO - Starting the sorting process of /gluster/home/chendongsheng/zhangyanan/velocyto/test_09240074/sort_V2R2109240074_anno_decon_change_CB.bam the output will be at: /gluster/home/chendongsheng/zhangyanan/velocyto/test_09240074/cellsorted_sort_V2R2109240074_anno_decon_change_CB.bam 2022-04-02 20:50:10,586 - INFO - Command being run is: samtools sort -l 7 -m 2048M -t CB -O BAM -@ 24 -o /gluster/home/chendongsheng/zhangyanan/velocyto/test_09240074/cellsorted_sort_V2R2109240074_anno_decon_change_CB.bam /gluster/home/chendongsheng/zhangyanan/velocyto/test_09240074/sort_V2R2109240074_anno_decon_change_CB.bam 2022-04-02 20:50:10,586 - INFO - While the bam sorting happens do other things... 2022-04-02 20:50:10,586 - INFO - Load the annotation from /gluster/home/chendongsheng/zhangyanan/velocyto/GCF_000002285.5_Dog10K_Boxer_Tasha_genomic.gtf 2022-04-02 20:50:20,142 - DEBUG - Parsing Chromosome NC_002008.4 strand - [line 0]

####### And I also checked the gif file, all gene_id are not empty. Do you have any idea about what's going on? Thanks a lot!

topafzd commented 2 years ago

For more information, my gtf looks like this:

gtf-version 2.2

!genome-build Dog10K_Boxer_Tasha

!genome-build-accession NCBI_Assembly:GCF_000002285.5

!annotation-source NCBI Canis lupus familiaris Annotation Release 106

NC_006583.4 Gnomon gene 29337 30631 . + . gene_id "LOC100684726"; transcript_id ""; db_xref "GeneID:100684726"; gbkey "Gene"; gene "LOC100684726"; gene_biotype "pseudogene"; pseudo "true"; NC_006583.4 Gnomon gene 651385 789323 . + . gene_id "LOC119870367"; transcript_id ""; db_xref "GeneID:119870367"; gbkey "Gene"; gene "LOC119870367"; gene_biotype "lncRNA"; NC_006583.4 Gnomon transcript 651385 789323 . + . gene_id "LOC119870367"; transcript_id "XR_005316961.1"; db_xref "GeneID:119870367"; gbkey "ncRNA"; gene "LOC119870367"; model_evidence "Supporting evidence includes similarity to: 1 EST, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 10 samples with support for all annotated introns"; product "uncharacterized LOC119870367"; transcript_biotype "lnc_RNA"; NC_006583.4 Gnomon exon 651385 652415 . + . gene_id "LOC119870367"; transcript_id "XR_005316961.1"; db_xref "GeneID:119870367"; gene "LOC119870367"; model_evidence "Supporting evidence includes similarity to: 1 EST, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 10 samples with support for all annotated introns"; product "uncharacterized LOC119870367"; transcript_biotype "lnc_RNA"; exon_number "1"; NC_006583.4 Gnomon exon 695111 695247 . + . gene_id "LOC119870367"; transcript_id "XR_005316961.1"; db_xref "GeneID:119870367"; gene "LOC119870367"; model_evidence "Supporting evidence includes similarity to: 1 EST, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 10 samples with support for all annotated introns"; product "uncharacterized LOC119870367"; transcript_biotype "lnc_RNA"; exon_number "2"; NC_006583.4 Gnomon exon 696316 696387 . + . gene_id "LOC119870367"; transcript_id "XR_005316961.1"; db_xref "GeneID:119870367"; gene "LOC119870367"; model_evidence "Supporting evidence includes similarity to: 1 EST, and 99% coverage of the annotated genomic feature by RNAseq alignments, including 10 samples with support for all annotated introns"; product "uncharacterized LOC119870367"; transcript_biotype "lnc_RNA"; exon_number "3";

topafzd commented 2 years ago

Hi, I have solved this problem. Maybe you want to check the chromosome names in your bam and gtf. I looked through the code of velocyto and found that velocyto can only recognize chromosome name like 1... or chr1... So I changed my chromosome name in bam and gtf from NC_006583.4... to chr1... and it works. Hope this can help you.

HuangDDU commented 2 years ago

Hi, I have solved this problem. Maybe you want to check the chromosome names in your bam and gtf. I looked through the code of velocyto and found that velocyto can only recognize chromosome name like 1... or chr1... So I changed my chromosome name in bam and gtf from NC_006583.4... to chr1... and it works. Hope this can help you.

Maybe that's right! But for me,the first column of gtf file downloaded from NNCBI has too many value,I don't know how to map them to stand format like chr1,chr2...。

There is the first colmn of my gtf file。 `(base) [21031211625@login01_d_02 resource]$ cut -f1 GCF_000001635.26_GRCm38.p6_genomic.gtf | sort | uniq

!annotation-source NCBI Mus musculus Annotation Release 108

!genome-build-accession NCBI_Assembly:GCF_000001635.26

!genome-build GRCm38.p6

gtf-version 2.2

NC_000067.6 NC_000068.7 NC_000069.6 NC_000070.6 NC_000071.6 NC_000072.6 NC_000073.6 NC_000074.6 NC_000075.6 NC_000076.6 NC_000077.6 NC_000078.6 NC_000079.6 NC_000080.6 NC_000081.6 NC_000082.6 NC_000083.6 NC_000084.6 NC_000085.6 NC_000086.7 NC_000087.7 NC_005089.1 NT_039192.2 NT_039194.5 NT_039195.1 NT_039198.1 NT_039223.3 NT_039247.2 NT_039248.1 NT_039252.1 NT_039256.1 NT_039257.4 NT_039291.1 NT_039296.1 NT_039334.1 NT_039367.1 NT_039372.2 NT_039373.1 NT_039382.4 NT_039443.1 NT_039469.1 NT_039470.1 NT_039525.1 NT_039593.1 NT_039594.3 NT_039614.1 NT_039622.1 NT_039630.5 NT_039634.4 NT_039661.4 NT_039662.3 NT_039666.2 NT_039697.1 NT_039739.2 NT_039744.2 NT_039747.4 NT_078572.1 NT_078683.1 NT_080256.1 NT_080257.1 NT_095534.1 NT_096355.1 NT_096797.1 NT_097336.1 NT_099024.1 NT_109318.2 NT_109599.1 NT_111920.1 NT_114257.4 NT_114985.3 NT_114988.1 NT_162750.1 NT_165789.2 NT_166280.1 NT_166281.1 NT_166282.1 NT_166283.1 NT_166287.2 NT_166291.1 NT_166293.1 NT_166294.1 NT_166295.1 NT_166299.2 NT_166305.2 NT_166307.1 NT_166317.2 NT_166322.1 NT_166394.1 NT_166396.1 NT_166397.1 NT_166399.1 NT_166400.1 NT_166433.1 NT_166434.1 NT_166438.1 NT_166443.1 NT_166450.1 NT_166456.1 NT_166462.1 NT_166463.1 NT_166466.1 NT_166469.1 NT_186999.1 NT_187000.1 NT_187001.1 NT_187002.1 NT_187003.1 NT_187004.1 NT_187005.1 NT_187006.1 NT_187007.1 NT_187008.1 NT_187009.1 NT_187010.1 NT_187011.1 NT_187012.1 NT_187013.1 NT_187014.1 NT_187015.1 NT_187016.1 NT_187017.1 NT_187018.1 NT_187019.1 NT_187020.1 NT_187021.1 NT_187022.1 NT_187023.1 NT_187024.1 NT_187025.1 NT_187026.1 NT_187027.1 NT_187028.1 NT_187029.1 NT_187030.1 NT_187052.1 NT_187053.1 NT_187054.1 NT_187055.1 NT_187056.1 NT_187057.1 NT_187058.1 NT_187059.1 NT_187060.1 NT_187064.1 NW_004058050.1 NW_004058051.1 NW_004058052.1 NW_004058053.1 NW_004058054.1 NW_004058055.2 NW_004058056.1 NW_004058058.1 NW_004450259.3 NW_004450260.1 NW_004450261.1 NW_004450262.2 NW_004450263.1 NW_006763128.1 NW_006763129.1 NW_006763130.1 NW_012132900.1 NW_012132901.1 NW_012132902.1 NW_012132903.1 NW_012132904.1 NW_012132905.1 NW_012132906.1 NW_012132907.1 NW_012132908.1 NW_012132909.2 NW_012132910.1 NW_012132911.1 NW_012132912.1 NW_012132913.1 NW_016097312.1 NW_016097313.1 NW_016097314.1 NW_016097315.1 NW_016097316.1 NW_016097317.1 NW_016097318.1 NW_016097319.1 NW_016097320.1 NW_016097321.1 NW_016097322.1 NW_019168503.1 NW_019168504.1 NW_019168505.1 NW_019168506.1 NW_019168507.1 NW_019168508.1 NW_019168509.1 NW_019168510.1 NW_019168511.1 NW_019168512.1 NW_019168513.1 NW_019168514.1 NW_019168515.1 NW_019168516.1 NW_019168517.1 NW_019168518.1 NW_019168519.1 NW_019168520.1 NW_019168521.1 NW_019168522.1 NW_019168523.1 NW_019168524.1 NW_019168525.1 NW_019168526.1 NW_019168527.1 NW_019168528.1 NW_019168529.1 NW_019168530.1 NW_019168531.1 NW_019168532.1 NW_019168533.1 NW_019168534.1`

kokitsuyuzaki commented 2 years ago

Same here. In my case, I performed 10X CellRanger against scRNA-seq data of Hemicentrotus pulcherrimus, which is a kind of sea urchin, and then tried to apply velocyto.

I used two GTF files convered by gffread from the GFF3 files of Hpbase (https://cell-innovation.nig.ac.jp/cgi-bin/Hpul_public/Hpul_annot_download.cgi) and Echinobase (https://www.echinobase.org/entry/static-echinobase/ftpDatafiles.jsp).

The error message was about the line below https://github.com/velocyto-team/velocyto.py/blob/0963dd2df0ac802c36404e0f434ba97f07edfe4b/velocyto/counter.py#L514

and I finally found out that after deleting the lines where gene_id are not written, velocyto worked.

The error message was like below:

(velocyto_env) velocyto run10x $SAMPLE $GTF
2022-10-05 14:44:49,023 - DEBUG - Using logic: Default
2022-10-05 14:44:49,037 - INFO - Read 5633 cell barcodes from /data2/koki/urchin-workflow2/output/echinobase/SeaUrchin-scRNA-01/outs/filtered_feature_bc_matrix/barcodes.tsv.gz
2022-10-05 14:44:49,037 - DEBUG - Example of barcode: AAACCCACACAAATGA and cell_id: SeaUrchin-scRNA-01:AAACCCACACAAATGA-1
2022-10-05 14:44:49,047 - DEBUG - Peeking into /data2/koki/urchin-workflow2/output/echinobase/SeaUrchin-scRNA-01/outs/possorted_genome_bam.bam
2022-10-05 14:44:49,055 - WARNING - Not found cell and umi barcode in entry 34 of the bam file
2022-10-05 14:44:49,055 - WARNING - Not found cell and umi barcode in entry 35 of the bam file
2022-10-05 14:44:49,055 - WARNING - Not found cell and umi barcode in entry 121 of the bam file
2022-10-05 14:44:49,056 - WARNING - Not found cell and umi barcode in entry 159 of the bam file
2022-10-05 14:44:49,056 - WARNING - Not found cell and umi barcode in entry 174 of the bam file
2022-10-05 14:44:49,056 - WARNING - Not found cell and umi barcode in entry 189 of the bam file
2022-10-05 14:44:49,056 - WARNING - Not found cell and umi barcode in entry 214 of the bam file
2022-10-05 14:44:49,056 - WARNING - Not found cell and umi barcode in entry 234 of the bam file
2022-10-05 14:44:49,056 - WARNING - Not found cell and umi barcode in entry 238 of the bam file
2022-10-05 14:44:49,056 - WARNING - Not found cell and umi barcode in entry 258 of the bam file
2022-10-05 14:44:49,057 - WARNING - Not found cell and umi barcode in entry 282 of the bam file
2022-10-05 14:44:49,057 - WARNING - Not found cell and umi barcode in entry 307 of the bam file
2022-10-05 14:44:49,057 - WARNING - Not found cell and umi barcode in entry 347 of the bam file
2022-10-05 14:44:49,057 - WARNING - Not found cell and umi barcode in entry 368 of the bam file
2022-10-05 14:44:49,057 - WARNING - Not found cell and umi barcode in entry 401 of the bam file
2022-10-05 14:44:49,058 - WARNING - Not found cell and umi barcode in entry 403 of the bam file
2022-10-05 14:44:49,058 - WARNING - Not found cell and umi barcode in entry 404 of the bam file
2022-10-05 14:44:49,058 - WARNING - Not found cell and umi barcode in entry 406 of the bam file
2022-10-05 14:44:49,058 - WARNING - Not found cell and umi barcode in entry 407 of the bam file
2022-10-05 14:44:49,058 - WARNING - Not found cell and umi barcode in entry 408 of the bam file
2022-10-05 14:44:49,058 - WARNING - Not found cell and umi barcode in entry 410 of the bam file
2022-10-05 14:44:49,058 - WARNING - Not found cell and umi barcode in entry 411 of the bam file
2022-10-05 14:44:49,058 - WARNING - Not found cell and umi barcode in entry 414 of the bam file
2022-10-05 14:44:49,058 - WARNING - Not found cell and umi barcode in entry 415 of the bam file
2022-10-05 14:44:49,059 - WARNING - Not found cell and umi barcode in entry 463 of the bam file
2022-10-05 14:44:49,059 - WARNING - Not found cell and umi barcode in entry 499 of the bam file
2022-10-05 14:44:49,059 - WARNING - Not found cell and umi barcode in entry 562 of the bam file
2022-10-05 14:44:49,059 - WARNING - Not found cell and umi barcode in entry 605 of the bam file
2022-10-05 14:44:49,060 - WARNING - Not found cell and umi barcode in entry 639 of the bam file
2022-10-05 14:44:49,060 - WARNING - Not found cell and umi barcode in entry 646 of the bam file
2022-10-05 14:44:49,063 - WARNING - The file /data2/koki/urchin-workflow2/output/echinobase/SeaUrchin-scRNA-01/outs/cellsorted_possorted_genome_bam.bam already exists. The sorting step will be skipped and the existing file will be used.
2022-10-05 14:44:49,063 - INFO - Load the annotation from /data2/koki/urchin-workflow2/data/echinobase/sp5_0_GCF.gtf
2022-10-05 14:44:49,399 - WARNING - The entry exon_number was not present in the gtf file. It will be infferred from the position.
2022-10-05 14:44:51,801 - DEBUG - Parsing Chromosome NC_001453.1 strand - [line 0]
Traceback (most recent call last):
  File "/home/koki/miniconda3/envs/velocyto_env/bin/velocyto", line 8, in <module>
    sys.exit(cli())
  File "/home/koki/miniconda3/envs/velocyto_env/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/koki/miniconda3/envs/velocyto_env/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/koki/miniconda3/envs/velocyto_env/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/koki/miniconda3/envs/velocyto_env/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/koki/miniconda3/envs/velocyto_env/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/koki/miniconda3/envs/velocyto_env/lib/python3.8/site-packages/velocyto/commands/run10x.py", line 112, in run10x
    return _run(bamfile=(bamfile, ), gtffile=gtffile, bcfile=bcfile, outputfolder=outputfolder,
  File "/home/koki/miniconda3/envs/velocyto_env/lib/python3.8/site-packages/velocyto/commands/_run.py", line 186, in _run
    annotations_by_chrm_strand = exincounter.read_transcriptmodels(gtffile)
  File "/home/koki/miniconda3/envs/velocyto_env/lib/python3.8/site-packages/velocyto/counter.py", line 514, in read_transcriptmodels
    geneid = regex_geneid.search(tags).group(1)
AttributeError: 'NoneType' object has no attribute 'group'