velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
160 stars 83 forks source link

UMI Barcode Entry and Samtools error on 10X Cell Ranger GTF and BAM files #311

Closed AAA-3 closed 3 years ago

AAA-3 commented 3 years ago

Hello all! I think I have issues with the BAM and GTF files but do not know how to resolve them. I only started working with Python this fortnight so any help would be useful!

I have scRNA Seq data which was pushed through the 10X Genomics Cell RangerPipeline. The resulting BAM File and the provided GTF file mm10 genome (10X reference version 2.1.0, GRCm38, Ensembl 84) do not seem to be compatible with Velocyto.

After running the follwoing code(... is added here to shorten address of file for convenience in reading code): velocyto run -b /.../barcodes.tsv -o /.../b2het_out -m .../mm10_repeat_mask.gtf .../possorted_genome_bam.bam .../genes.gtf I get this error saying the UMI barcode entry is missing from the BAM file and that I need to run Samtools on the file to sort it. According to the tutorial on the Velocyto website, this should not be necessaary since Cell Ranger does the sorting already.

Any idea how to resolve this issue? Or how to format my GTF files if needed?

2021-07-28 16:37:44,746 - INFO - No SAMPLEID specified, the sample will be called possorted_genome_bam_S19RV (last 5 digits are a random-id to avoid overwriting some other file by mistake)
2021-07-28 16:37:44,746 - DEBUG - Using logic: Default
2021-07-28 16:37:44,749 - INFO - Read 3281 cell barcodes from /home/ali/Dokumente/RPractise/Run_alle_features/Velocity/PythonCodes/B2_HET/barcodes.tsv
2021-07-28 16:37:44,749 - DEBUG - Example of barcode: AAACCTGAGGCTATCT and cell_id: possorted_genome_bam_S19RV:AAACCTGAGGCTATCT-1
2021-07-28 16:37:44,774 - DEBUG - Peeking into /home/ali/Dokumente/RPractise/Run_alle_features/Velocity/PythonCodes/B2_HET/possorted_genome_bam.bam
[E::idx_find_and_load] Could not retrieve index file for '/home/ali/Dokumente/RPractise/Run_alle_features/Velocity/PythonCodes/B2_HET/possorted_genome_bam.bam'
2021-07-28 16:37:44,776 - WARNING - Not found cell and umi barcode in entry 12 of the bam file
2021-07-28 16:37:44,776 - WARNING - Not found cell and umi barcode in entry 19 of the bam file
2021-07-28 16:37:44,776 - WARNING - Not found cell and umi barcode in entry 23 of the bam file
2021-07-28 16:37:44,777 - WARNING - Not found cell and umi barcode in entry 25 of the bam file
2021-07-28 16:37:44,777 - WARNING - Not found cell and umi barcode in entry 82 of the bam file
2021-07-28 16:37:44,777 - WARNING - Not found cell and umi barcode in entry 137 of the bam file
2021-07-28 16:37:44,777 - WARNING - Not found cell and umi barcode in entry 138 of the bam file
2021-07-28 16:37:44,778 - WARNING - Not found cell and umi barcode in entry 218 of the bam file
2021-07-28 16:37:44,778 - WARNING - Not found cell and umi barcode in entry 280 of the bam file
2021-07-28 16:37:44,778 - WARNING - Not found cell and umi barcode in entry 281 of the bam file
2021-07-28 16:37:44,778 - WARNING - Not found cell and umi barcode in entry 282 of the bam file
2021-07-28 16:37:44,778 - WARNING - Not found cell and umi barcode in entry 283 of the bam file
2021-07-28 16:37:44,779 - WARNING - Not found cell and umi barcode in entry 349 of the bam file
2021-07-28 16:37:44,780 - WARNING - Not found cell and umi barcode in entry 558 of the bam file
2021-07-28 16:37:44,780 - WARNING - Not found cell and umi barcode in entry 564 of the bam file
2021-07-28 16:37:44,781 - WARNING - Not found cell and umi barcode in entry 649 of the bam file
2021-07-28 16:37:44,781 - WARNING - Not found cell and umi barcode in entry 654 of the bam file
2021-07-28 16:37:44,781 - WARNING - Not found cell and umi barcode in entry 696 of the bam file
2021-07-28 16:37:44,781 - WARNING - Not found cell and umi barcode in entry 697 of the bam file
2021-07-28 16:37:44,782 - WARNING - Not found cell and umi barcode in entry 796 of the bam file
2021-07-28 16:37:44,782 - WARNING - Not found cell and umi barcode in entry 818 of the bam file
2021-07-28 16:37:44,782 - WARNING - Not found cell and umi barcode in entry 819 of the bam file
2021-07-28 16:37:44,782 - WARNING - Not found cell and umi barcode in entry 821 of the bam file
2021-07-28 16:37:44,782 - WARNING - Not found cell and umi barcode in entry 905 of the bam file
2021-07-28 16:37:44,782 - WARNING - Not found cell and umi barcode in entry 906 of the bam file
2021-07-28 16:37:44,782 - WARNING - Not found cell and umi barcode in entry 907 of the bam file
2021-07-28 16:37:44,782 - WARNING - Not found cell and umi barcode in entry 909 of the bam file
Traceback (most recent call last):
  File "/home/ali/anaconda3/bin/velocyto", line 8, in <module>
    sys.exit(cli())
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/velocyto/commands/run.py", line 113, in run
    return _run(bamfile=bamfile, gtffile=gtffile, bcfile=bcfile, outputfolder=outputfolder,
  File "/home/ali/anaconda3/lib/python3.8/site-packages/velocyto/commands/_run.py", line 178, in _run
    sorting_process[ni] = subprocess.Popen(command.split(), stdout=subprocess.PIPE)
  File "/home/ali/anaconda3/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/ali/anaconda3/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'samtools'
AAA-3 commented 3 years ago

Would the old version of Cell Ranger have anything to do with this?

zhenzuo2 commented 3 years ago

[E::idx_find_and_load] Could not retrieve index file for '/home/ali/Dokumente/RPractise/Run_alle_features/Velocity/PythonCodes/B2_HET/possorted_genome_bam.bam'

So you need index your bam file to get possorted_genome_bam.bam.bai first.

AAA-3 commented 3 years ago

Hello Zhenzuo,

I do already have that file in the same folder as everything else but good to know what that error means - I've noticed that line of error comes and goes but I think the problem might be something else. Eg. I ran the same code last week and saved the error output which still mentioned no cell and UMI barcode, and no Samtools directory, even though CellRanger output doesn't need it:

2021-07-23 10:43:03,994 - INFO - No SAMPLEID specified, the sample will be called possorted_genome_bam_Y8J8Q (last 5 digits are a random-id to avoid overwriting some other file by mistake)
2021-07-23 10:43:03,994 - DEBUG - Using logic: Default
2021-07-23 10:43:03,997 - INFO - Read 3281 cell barcodes from /home/ali/Dokumente/RPractise/Run_alle_features/Velocity/PythonCodes/B2_HET/barcodes.tsv
2021-07-23 10:43:03,997 - DEBUG - Example of barcode: AAACCTGAGGCTATCT and cell_id: possorted_genome_bam_Y8J8Q:AAACCTGAGGCTATCT-1
2021-07-23 10:43:04,024 - DEBUG - Peeking into /home/ali/Dokumente/RPractise/Run_alle_features/Velocity/PythonCodes/B2_HET/possorted_genome_bam.bam
2021-07-23 10:43:04,048 - WARNING - Not found cell and umi barcode in entry 12 of the bam file
2021-07-23 10:43:04,048 - WARNING - Not found cell and umi barcode in entry 19 of the bam file
2021-07-23 10:43:04,048 - WARNING - Not found cell and umi barcode in entry 23 of the bam file
2021-07-23 10:43:04,048 - WARNING - Not found cell and umi barcode in entry 25 of the bam file
2021-07-23 10:43:04,048 - WARNING - Not found cell and umi barcode in entry 82 of the bam file
2021-07-23 10:43:04,048 - WARNING - Not found cell and umi barcode in entry 137 of the bam file
2021-07-23 10:43:04,048 - WARNING - Not found cell and umi barcode in entry 138 of the bam file
2021-07-23 10:43:04,049 - WARNING - Not found cell and umi barcode in entry 218 of the bam file
2021-07-23 10:43:04,049 - WARNING - Not found cell and umi barcode in entry 280 of the bam file
2021-07-23 10:43:04,049 - WARNING - Not found cell and umi barcode in entry 281 of the bam file
2021-07-23 10:43:04,049 - WARNING - Not found cell and umi barcode in entry 282 of the bam file
2021-07-23 10:43:04,049 - WARNING - Not found cell and umi barcode in entry 283 of the bam file
2021-07-23 10:43:04,049 - WARNING - Not found cell and umi barcode in entry 349 of the bam file
2021-07-23 10:43:04,050 - WARNING - Not found cell and umi barcode in entry 558 of the bam file
2021-07-23 10:43:04,050 - WARNING - Not found cell and umi barcode in entry 564 of the bam file
2021-07-23 10:43:04,050 - WARNING - Not found cell and umi barcode in entry 649 of the bam file
2021-07-23 10:43:04,051 - WARNING - Not found cell and umi barcode in entry 654 of the bam file
2021-07-23 10:43:04,051 - WARNING - Not found cell and umi barcode in entry 696 of the bam file
2021-07-23 10:43:04,051 - WARNING - Not found cell and umi barcode in entry 697 of the bam file
2021-07-23 10:43:04,051 - WARNING - Not found cell and umi barcode in entry 796 of the bam file
2021-07-23 10:43:04,051 - WARNING - Not found cell and umi barcode in entry 818 of the bam file
2021-07-23 10:43:04,051 - WARNING - Not found cell and umi barcode in entry 819 of the bam file
2021-07-23 10:43:04,051 - WARNING - Not found cell and umi barcode in entry 821 of the bam file
2021-07-23 10:43:04,051 - WARNING - Not found cell and umi barcode in entry 905 of the bam file
2021-07-23 10:43:04,052 - WARNING - Not found cell and umi barcode in entry 906 of the bam file
2021-07-23 10:43:04,052 - WARNING - Not found cell and umi barcode in entry 907 of the bam file
2021-07-23 10:43:04,052 - WARNING - Not found cell and umi barcode in entry 909 of the bam file
Traceback (most recent call last):
  File "/home/ali/anaconda3/bin/velocyto", line 8, in <module>
    sys.exit(cli())
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/velocyto/commands/run.py", line 113, in run
    return _run(bamfile=bamfile, gtffile=gtffile, bcfile=bcfile, outputfolder=outputfolder,
  File "/home/ali/anaconda3/lib/python3.8/site-packages/velocyto/commands/_run.py", line 178, in _run
    sorting_process[ni] = subprocess.Popen(command.split(), stdout=subprocess.PIPE)
  File "/home/ali/anaconda3/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/ali/anaconda3/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'samtools'

Also not sure how to interpret the trackbacks....

AAA-3 commented 3 years ago

I also get the same error when I use velocyto run10x:

2021-07-29 10:28:56,373 - DEBUG - Example of barcode: AAACCTGAGGCTATCT and cell_id: Wittmann_2_Hetero:AAACCTGAGGCTATCT-1
2021-07-29 10:28:56,400 - DEBUG - Peeking into /home/ali/Dokumente/RPractise/E14.5_Auswertung/Rohdaten/SCS_Wittmann/Batch_2/Wittmann_2_Hetero/outs/possorted_genome_bam.bam
2021-07-29 10:28:56,422 - WARNING - Not found cell and umi barcode in entry 12 of the bam file

...

Traceback (most recent call last):
  File "/home/ali/anaconda3/bin/velocyto", line 8, in <module>
    sys.exit(cli())
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/ali/anaconda3/lib/python3.8/site-packages/velocyto/commands/run10x.py", line 112, in run10x
    return _run(bamfile=(bamfile, ), gtffile=gtffile, bcfile=bcfile, outputfolder=outputfolder,
  File "/home/ali/anaconda3/lib/python3.8/site-packages/velocyto/commands/_run.py", line 178, in _run
    sorting_process[ni] = subprocess.Popen(command.split(), stdout=subprocess.PIPE)
  File "/home/ali/anaconda3/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/ali/anaconda3/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'samtools'
zhenzuo2 commented 3 years ago

Have you tried python 3.6? I am not sure if velocyto supports python 3.8 or not.

davidhbrann commented 3 years ago

You also need to have samtools installed or on your path: http://www.htslib.org/.

That's why you're getting the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'samtools'
AAA-3 commented 3 years ago

You also need to have samtools installed or on your path: http://www.htslib.org/.

That's why you're getting the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'samtools'

I thought Samtools is only for sorting the BAM file. My BAM file comes from Cell Ranger which I think does not need to be sorted.

Quote from the tutorial: The input bam file needs to be sorted by position, this can be achieved running samtools sort mybam.bam -o sorted_bam.bam. In cellranger generated bamfiles are already sorted this way.

davidhbrann commented 3 years ago

You're reading the part of the tutorial for smartseq2. Yes, the cellranger input bam possorted_genome_bam.bam is already sorted by position, but as part of run10x it uses samtools to sort it by cell id.

As you can see in your traceback, your code is erroring on the following lines on the subprocess call to samtools: https://github.com/velocyto-team/velocyto.py/blob/0963dd2df0ac802c36404e0f434ba97f07edfe4b/velocyto/commands/_run.py#L170-L182

AAA-3 commented 3 years ago

Aaaaaah - this makes everything clearer for me - THANKS! The quote came from the run section of the tutorial but I did notice the run10x signature listing some Samtools options but it didn't register to me I still needed it even if my .BAM file was sorted by position (didn't realise it needed to be sorted by ID). I initially had samtools but removed it because of this and because I didn't want Samtools "over correcting" my data.....does that make sense?

This time round the code seemed to run fine but still getting the following warnings/errors:

  1. Not found cell and umi barcode in entry __ of the bam file
  2. Sample ID Wittmann_2_Hetero not found in sample sheet
  3. WARNING - The .bam file refers to a chromosome '___--' not present in the annotation (.gtf) file
  4. 2021-08-02 10:42:33,716 - DEBUG - 2997866 reads were skipped because no apropiate cell or umi barcode was found 5.[E::idx_find_and_load] Could not retrieve index file for cellsorted_possorted genome bam.bam'

Any idea why that could be? 5 and 6 is my biggest concern (the output of samtools procedure - not entirely sure what index file is needed here... - and the discrepancy in chromosomes)

I understand the .BAM file cannot be changed at this point (so I assume we cannot do anything about 1) but the discrepancy between the .BAM and .GTF files (both from cell ranger) worries me (errors 3 and 4). I assume error 2 is to do with my metadata file which I do not mind.

msalaciak commented 3 years ago

Looking through the output I noticed I also had this [E::idx_find_and_load] Could not retrieve index file for cellsorted_possorted genome bam.bam' error...not 100% sure it's important though.

I found this in the samtools manual for the sort function.

Note that if the sorted output file is to be indexed with samtools index, the default coordinate sort must be used. Thus the -n and -t options are incompatible with samtools index.

Since we're sorting by CB I don't think we need to index it.

also for Not found cell and umi barcode in entry __ of the bam file error which I also have, I think it's because we're using the filtered barcodes as input as well as the bam file right? so I would imagine some being absent.

I'll wait for someone else to chime in though but I think this makes the most sense!

denvercal1234GitHub commented 3 years ago

@AAA-3 --- Did you end up using run or run10x? And, did you have to samtools sort the bam file separately, then index that cellsorted_possorted bam file to get the index, then run the "run" command?

I run into out-of-memory issue described in issue #320. Any thoughts would be really appreciated!

Thank you so much!