jphe / scTE

MIT License
12 stars 3 forks source link

CB vs CR in input bam #7

Open leeyoyohku opened 3 years ago

leeyoyohku commented 3 years ago

Dear scTE team,

I'm encountering a problem regarding the CR field in input bam files. In my bam files, corrected barcodes are stored in CB, while the uncorrected barcodes are in CR. A line of bam file example is pasted here:

CL100132075L2C003R068_87600 256 chr1 14041 0 100M * 0 0 TTGGTGCCAGTTCCTCCAAGTCGATGGCACCTCCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATCTTCTACCCCAACACCAGCAATTGTGCCAAG D.FBC>C@DAFFADFEBFFFBEE;;FD0F@DFFFCFFFDFFFF9=DFFF8EE,FFD>CDFFF<FBDFD:EFEEFE<=ED/7FDFFE;F)/BEFDC1EA-C NH:i:9 HI:i:4 nM:i:0 AS:i:98 CR:Z:GTGACTGTAC_TACTAATGGC UR:Z:GGGCCATCAG CB:Z:GTGACTGTAC_TACTAATGGC UB:Z:GGGCCATCAG

As a result, in the output h5ad, there are way more observations than expected, but the upper limit is 10k. So I don't think it captures all lines from bam.

I tried to change my bam with the following code but it reported truncated file error. samtools view -h new/D0_2Aligned.sortedByCoord.out.bam | awk '{$16=$17=""; print $0}'|sed 's/UB:/UR:/g' |sed 's/CB:/CR:/g' |samtools view -h -b > D0_2_CB.bam

Does scTE come with the function to check the CB field instead of CR field? Thanks.

jphe commented 3 years ago

The scTE have moved to: https://github.com/JiekaiLab/scTE

you can set the: -CB CB -UMI UB, which will works for your bam file

leeyoyohku commented 3 years ago

Thank @jphe for your reply. I thought -CB and -UMI only take TRUE/FALSE as input. As I tried the command as you suggested scTE -i D0_1Aligned.sortedByCoord.out.bam -o D0_1 -g hg38 -x ~/yoyo/0_reference/scTEindex.idx.exclusive.idx -p 10 -CB CB -UMI UB --hdf5 True The error occurs: scTE: error: argument -CB: invalid choice: 'CB' (choose from 'True', 'False')

jphe commented 3 years ago

The scTE have moved to: https://github.com/JiekaiLab/scTE, you need download the latest version there

leeyoyohku commented 3 years ago

Thank you! As I installed the new version with the following instructions under a python=3.7.10 environment, $ git clone https://github.com/jphe/scTE.git $ cd scTE $ python setup.py install I still encountered the similar error but with 4 lines of DEBUG this time, seen as below:

scTE -i new/D0_2Aligned.sortedByCoord.out.bam -o D0_1 -g hg38 -x ~/yoyo/0_reference/scTE_human.exclusive.idx -p 10 -CB CB -UMI UB --hdf5 True
DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
usage: scTE [-h] [--min_genes INT] [--min_counts INT] [--expect-cells INT]
            [-f [input file format]] [-CB [{True,False}]]
            [-UMI [{True,False}]] [--keeptmp [{True,False}]]
            [--hdf5 [{True,False}]] [-p INT] [-v] -i INPUT [INPUT ...] -x
            ANNOGLB [ANNOGLB ...] -g [genome] -o [OUT]
scTE: error: argument -CB: invalid choice: 'CB' (choose from 'True', 'False')

Much appreciated for your timely help.

jphe commented 3 years ago

The newest scTE have moved to: https://github.com/JiekaiLab/scTE, you need download the latest version under that repository

git clone https://github.com/JiekaiLab/scTE.git, then re-install it

leeyoyohku commented 3 years ago

Ohhhh sorry that I didn't revise the git link cuz I only followed the installation instruction on the new GitHub page. Now it has taken my -CB input, however, it seems not able to recognise my CB even though they present:

$ scTE -i new/D0_2Aligned.sortedByCoord.out.bam -o D0_2 -x ~/yoyo/0_reference/scTE_human.exclusive.idx -p 10 -CB CB -UMI UB --hdf5 True
DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
INFO    : Parameter list:
Sample = D0_1
Reference annotation index = /home/yoyo/yoyo/0_reference/scTE_human.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 10

INFO    : Loading the genome annotation index... 2021-03-22 03:38:51
INFO    : Loaded '/home/yoyo/yoyo/0_reference/scTE_human.exclusive.idx' binary file with 5589235 items
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'GL000009.2', 'GL000194.1', 'GL000195.1', 'GL000205.2', 'GL000209.2', 'GL000213.1', 'GL000218.1', 'GL000219.1', 'GL000220.1', 'GL000250.2', 'GL000251.2', 'GL000252.2', 'GL000253.2', 'GL000254.2', 'GL000255.2', 'GL000256.2', 'GL000257.2', 'GL000258.2', 'GL339449.2', 'GL383518.1', 'GL383519.1', 'GL383520.2', 'GL383521.1', 'GL383522.1', 'GL383526.1', 'GL383527.1', 'GL383528.1', 'GL383531.1', 'GL383532.1', 'GL383533.1', 'GL383534.2', 'GL383539.1', 'GL383540.1', 'GL383541.1', 'GL383542.1', 'GL383545.1', 'GL383546.1', 'GL383550.2', 'GL383551.1', 'GL383552.1', 'GL383553.2', 'GL383554.1', 'GL383555.2', 'GL383556.1', 'GL383557.1', 'GL383563.3', 'GL383564.2', 'GL383565.1', 'GL383566.1', 'GL383567.1', 'GL383569.1', 'GL383570.1', 'GL383571.1', 'GL383573.1', 'GL383574.1', 'GL383575.2', 'GL383576.1', 'GL383577.2', 'GL383579.2', 'GL383580.2', 'GL383581.2', 'GL383582.2', 'GL383583.2', 'GL582966.2', 'GL877875.1', 'GL877876.1', 'GL949742.1', 'GL949746.1', 'GL949747.2', 'GL949748.2', 'GL949749.2', 'GL949750.2', 'GL949751.2', 'GL949752.1', 'GL949753.2', 'JH159136.1', 'JH159137.1', 'JH159146.1', 'JH159147.1', 'JH159148.1', 'JH636055.2', 'KB021644.2', 'KB663609.1', 'KI270442.1', 'KI270711.1', 'KI270713.1', 'KI270721.1', 'KI270726.1', 'KI270727.1', 'KI270728.1', 'KI270731.1', 'KI270733.1', 'KI270734.1', 'KI270744.1', 'KI270750.1', 'KI270758.1', 'KI270759.1', 'KI270760.1', 'KI270762.1', 'KI270763.1', 'KI270765.1', 'KI270766.1', 'KI270767.1', 'KI270768.1', 'KI270769.1', 'KI270770.1', 'KI270772.1', 'KI270774.1', 'KI270776.1', 'KI270777.1', 'KI270779.1', 'KI270780.1', 'KI270781.1', 'KI270782.1', 'KI270783.1', 'KI270784.1', 'KI270785.1', 'KI270786.1', 'KI270788.1', 'KI270789.1', 'KI270790.1', 'KI270791.1', 'KI270792.1', 'KI270793.1', 'KI270794.1', 'KI270795.1', 'KI270796.1', 'KI270797.1', 'KI270798.1', 'KI270799.1', 'KI270800.1', 'KI270801.1', 'KI270802.1', 'KI270803.1', 'KI270804.1', 'KI270805.1', 'KI270806.1', 'KI270807.1', 'KI270808.1', 'KI270809.1', 'KI270810.1', 'KI270811.1', 'KI270812.1', 'KI270813.1', 'KI270814.1', 'KI270815.1', 'KI270816.1', 'KI270817.1', 'KI270818.1', 'KI270819.1', 'KI270821.1', 'KI270822.1', 'KI270823.1', 'KI270824.1', 'KI270825.1', 'KI270829.1', 'KI270830.1', 'KI270831.1', 'KI270832.1', 'KI270833.1', 'KI270834.1', 'KI270835.1', 'KI270837.1', 'KI270838.1', 'KI270840.1', 'KI270842.1', 'KI270844.1', 'KI270845.1', 'KI270846.1', 'KI270847.1', 'KI270848.1', 'KI270849.1', 'KI270850.1', 'KI270851.1', 'KI270852.1', 'KI270853.1', 'KI270854.1', 'KI270855.1', 'KI270856.1', 'KI270857.1', 'KI270858.1', 'KI270859.1', 'KI270860.1', 'KI270861.1', 'KI270862.1', 'KI270863.1', 'KI270865.1', 'KI270866.1', 'KI270867.1', 'KI270868.1', 'KI270869.1', 'KI270870.1', 'KI270871.1', 'KI270872.1', 'KI270873.1', 'KI270874.1', 'KI270875.1', 'KI270876.1', 'KI270877.1', 'KI270878.1', 'KI270879.1', 'KI270880.1', 'KI270881.1', 'KI270882.1', 'KI270883.1', 'KI270884.1', 'KI270885.1', 'KI270886.1', 'KI270887.1', 'KI270888.1', 'KI270889.1', 'KI270890.1', 'KI270891.1', 'KI270892.1', 'KI270893.1', 'KI270894.1', 'KI270895.1', 'KI270896.1', 'KI270897.1', 'KI270898.1', 'KI270899.1', 'KI270901.1', 'KI270903.1', 'KI270904.1', 'KI270905.1', 'KI270906.1', 'KI270907.1', 'KI270908.1', 'KI270909.1', 'KI270910.1', 'KI270911.1', 'KI270912.1', 'KI270913.1', 'KI270914.1', 'KI270915.1', 'KI270916.1', 'KI270917.1', 'KI270918.1', 'KI270919.1', 'KI270920.1', 'KI270921.1', 'KI270924.1', 'KI270925.1', 'KI270926.1', 'KI270927.1', 'KI270928.1', 'KI270930.1', 'KI270931.1', 'KI270932.1', 'KI270933.1', 'KI270934.1', 'KI270935.1', 'KI270936.1', 'KI270937.1', 'KI270938.1', 'KN196472.1', 'KN196473.1', 'KN196474.1', 'KN196475.1', 'KN196476.1', 'KN196477.1', 'KN196478.1', 'KN196479.1', 'KN196480.1', 'KN196481.1', 'KN196484.1', 'KN196485.1', 'KN196486.1', 'KN196487.1', 'KN538360.1', 'KN538361.1', 'KN538362.1', 'KN538363.1', 'KN538364.1', 'KN538368.1', 'KN538369.1', 'KN538370.1', 'KN538371.1', 'KN538372.1', 'KN538373.1', 'KQ031383.1', 'KQ031384.1', 'KQ031385.1', 'KQ031387.1', 'KQ031388.1', 'KQ031389.1', 'KQ090014.1', 'KQ090015.1', 'KQ090016.1', 'KQ090018.1', 'KQ090021.1', 'KQ090022.1', 'KQ090026.1', 'KQ090027.1', 'KQ090028.1', 'KQ458382.1', 'KQ458383.1', 'KQ458384.1', 'KQ458385.1', 'KQ458386.1', 'KQ458387.1', 'KQ458388.1', 'KQ759759.1', 'KQ759760.1', 'KQ759761.1', 'KQ759762.1', 'KQ983255.1', 'KQ983256.1', 'KQ983257.1', 'KQ983258.1', 'KV575243.1', 'KV575244.1', 'KV575245.1', 'KV575246.1', 'KV575247.1', 'KV575248.1', 'KV575249.1', 'KV575250.1', 'KV575251.1', 'KV575252.1', 'KV575253.1', 'KV575254.1', 'KV575255.1', 'KV575256.1', 'KV575257.1', 'KV575258.1', 'KV575259.1', 'KV575260.1', 'KV766192.1', 'KV766193.1', 'KV766194.1', 'KV766195.1', 'KV766196.1', 'KV766198.1', 'KV880763.1', 'KV880764.1', 'KV880765.1', 'KV880766.1', 'KV880768.1', 'KZ208904.1', 'KZ208905.1', 'KZ208906.1', 'KZ208907.1', 'KZ208908.1', 'KZ208909.1', 'KZ208911.1', 'KZ208912.1', 'KZ208913.1', 'KZ208914.1', 'KZ208915.1', 'KZ208916.1', 'KZ208917.1', 'KZ208919.1', 'KZ208920.1', 'KZ208921.1', 'KZ208922.1', 'KZ208924.1', 'KZ559102.1', 'KZ559103.1', 'KZ559104.1', 'KZ559105.1', 'KZ559106.1', 'KZ559107.1', 'KZ559108.1', 'KZ559109.1', 'KZ559110.1', 'KZ559111.1', 'KZ559112.1', 'KZ559113.1', 'KZ559114.1', 'KZ559115.1', 'KZ559116.1', 'M', 'ML143341.1', 'ML143342.1', 'ML143343.1', 'ML143344.1', 'ML143345.1', 'ML143347.1', 'ML143349.1', 'ML143350.1', 'ML143352.1', 'ML143354.1', 'ML143355.1', 'ML143357.1', 'ML143358.1', 'ML143359.1', 'ML143360.1', 'ML143361.1', 'ML143362.1', 'ML143366.1', 'ML143367.1', 'ML143369.1', 'ML143370.1', 'ML143371.1', 'ML143372.1', 'ML143373.1', 'ML143374.1', 'ML143375.1', 'ML143376.1', 'ML143377.1', 'ML143378.1', 'ML143380.1', 'ML143381.1', 'X', 'Y']
INFO    : Finished loading the genome annotation index... 2021-03-22 03:39:23

INFO    : Processing BAM/SAM files ...2021-03-22 03:39:23
ERROR   : The input file new/D0_2Aligned.sortedByCoord.out.bam has no cell barcodes information, plese make sure the aligner have add the cell barcode key, or set CB to False

Below is how my bam file looks like:

samtools view new/D0_2Aligned.sortedByCoord.out.bam |head -5
CL100132075L2C003R031_474121    272 chr1    11926   0   100M    *   0   0   GTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGC    FFFCEEFDFFFFFEFF;DGFFF8GCFFABBFFFF@FFFFA:DGFGFFFFEFFEFGFFDG@FGFFFCFFFGFEFFDFFFGFFGFFFCFD=DFFFFEFFFFF    NH:i:7  HI:i:2  nM:i:0  AS:i:98 CR:Z:CATACACTTG_ATTATTGTGC  UR:Z:GCCGTATGTT CB:Z:CATACACTTG_ATTATTGTGC  UB:Z:GCCGTATGTT
CL100132075L2C007R010_384571    272 chr1    12022   0   100M    *   0   0   CAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCGTTGTTCATC    FFFFDBFFFFFFF?FCF;FEFFFFF5FFFFFFEFFFFEF>FFFF?FFFFF8FEDEFFFFFFFFFFFFBFFFFFFFFFFFFFFF;FEFFFFFFFFF>FFFF    NH:i:10 HI:i:3  nM:i:1  AS:i:96 CR:Z:CATACACTTG_ATTATTGTGC  UR:Z:GCCGNATGTT
CL100132075L2C007R055_495884    272 chr1    12025   0   100M    *   0   0   CAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTC    BEEAEEFGEEFCFEFEFCEEEEEEEEFEGEEEEEF<GEEBEEEEEEFAEFGGEEEEEBEEEGEFGCEEEGFEBEEEEFFEEEEFFGDGFEEEGEEBEFF<    NH:i:10 HI:i:7  nM:i:0  AS:i:98 CR:Z:TGGTTAGGCA_TGGATTAGAC  UR:Z:GCATCAAGAC CB:Z:TGGTTAGGCA_TGGATTAGAC  UB:Z:GCATCAAGAC
CL100132075L2C001R049_91491 272 chr1    12026   0   100M    *   0   0   AACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCT    359GA@F8;DDF?FEF<ED7ECEEEEAECB@E=6;?EE@@>BEEEEEEE.@EEEBEEEADBEE?CGEEGEBEDEEEEGGEEEDFDDEEGEEGDBFFFFFF    NH:i:10 HI:i:7  nM:i:0  AS:i:98 CR:Z:TGGTTAGGCA_TGGATTAGAC  UR:Z:GCATCAAGAC CB:Z:TGGTTAGGCA_TGGATTAGAC  UB:Z:GCATCAAGAC
CL100132075L2C002R063_343056    272 chr1    12039   0   100M    *   0   0   TGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTC    GFFDFGFFFGFGF:F2FF<EFDFFCEG6FGAFF>EAFDEGFFGF>FFEFGFFGGFGAEFGFFFEFFFGEGGEGFFFE?BFGEFFFGGEGFFGGFGBFGGF    NH:i:10 HI:i:7  nM:i:0  AS:i:98 CR:Z:TGGTTAGGCA_TGGATTAGAC  UR:Z:GCATCAAGTC CB:Z:TGGTTAGGCA_TGGATTAGAC  UB:Z:GCATCAAGTC

I wonder if it's because of the in my CB field and tried 900 lines with removed, but the error persists. Also in the previous version it could recognise my _ linked uncorrected barcode in CR field. I'm not sure why it happens.

Much appreciated for your timely help.

jphe commented 3 years ago

How do you generate the bam file? the file seems corrupted, the 2nd line has no cell barcode (CB:Z) information, but only CR:Z

leeyoyohku commented 3 years ago

I generated the bam using STARsolo with 1MM for error correction, seen the exact line of code below. Regarding some empty CB fields, I think they are reads that cannot be matched to any barcode even allowing 1 mismatch. The bam file went smoothly with velocyto.

STAR --genomeDir /home/yoyo/yoyo/0_reference/4_gencode.v36.starindex --readFilesIn 0_raw/CL100132075_L02_read_2.fq.gz 0_raw/CL100132075_L02_read_1.fq.gz --soloCBwhitelist barcode_list.txt barcode_list.txt --runThreadN 8 --soloType CB_UMI_Complex --soloCBposition 0_0_0_9 0_16_0_25 --soloUMIposition 0_30_1_-1 --soloCBmatchWLtype 1MM --outSAMattributes NH HI nM AS CR UR CB UB --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c --outFileNamePrefix new/D0_2

jphe commented 3 years ago

Maybe you can do pre-filtering to remove the empty CB reads, and then run scTE. We'll check this in detail later and to see if can ignore those reads in scTE.

leeyoyohku commented 3 years ago

Thank @jphe for your generous help. I removed lines without CB from bam using samtools view new/D0_2Aligned.sortedByCoord.out.bam -h | awk '/^@/ || /CB:/' |samtools view -h -b >new/D0_2_CB_clean.bam ran the scTE command again scTE -i new/D0_2_CB_clean.bam -o new/D0_2 -x ~/yoyo/0_reference/scTE_human.exclusive.idx -p 10 -CB CB -UMI UB --hdf5 True and ended up with the following index out of range error. Even when I tried the command without -CB and -UMI flags, there's no h5 output file.

INFO    : Loading the genome annotation index... 2021-03-22 07:18:05
DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
INFO    : Parameter list:
Sample = new/D0_2
Reference annotation index = /home/yoyo/yoyo/0_reference/scTE_human.exclusive.idx
Minimum number of genes required = 200
Minimum number of counts required = None
Number of threads = 10

INFO    : Loading the genome annotation index... 2021-03-22 07:18:14
INFO    : Loaded '/home/yoyo/yoyo/0_reference/scTE_human.exclusive.idx' binary file with 5589235 items
INFO    : Finished loading the genome annotation index... 2021-03-22 07:18:57

INFO    : Processing BAM/SAM files ...2021-03-22 07:18:57
INFO    : Input SAM/BAM file appears to be valid
INFO    : Done BAM/SAM files processing ...2021-03-22 07:49:29

INFO    : Splitting ...2021-03-22 08:31:40
INFO    : Executing multiple thread path with 10 threads
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/yoyo/miniconda2/envs/scTE/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/yoyo/miniconda2/envs/scTE/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/yoyo/miniconda2/envs/scTE/lib/python3.7/site-packages/scTE-1.0-py3.7.egg/scTE/base.py", line 366, in splitChr
    CRs[t[3]] += 1
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/yoyo/miniconda2/envs/scTE/bin/scTE", line 4, in <module>
    __import__('pkg_resources').run_script('scTE==1.0', 'scTE')
  File "/home/yoyo/miniconda2/envs/scTE/lib/python3.7/site-packages/pkg_resources/__init__.py", line 651, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/yoyo/miniconda2/envs/scTE/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1448, in run_script
    exec(code, namespace, namespace)
  File "/home/yoyo/miniconda2/envs/scTE/lib/python3.7/site-packages/scTE-1.0-py3.7.egg/EGG-INFO/scripts/scTE", line 169, in <module>
    main()
  File "/home/yoyo/miniconda2/envs/scTE/lib/python3.7/site-packages/scTE-1.0-py3.7.egg/EGG-INFO/scripts/scTE", line 134, in main
    pool.map(partial_work, chr_list)
  File "/home/yoyo/miniconda2/envs/scTE/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/yoyo/miniconda2/envs/scTE/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
IndexError: list index out of range
leeyoyohku commented 3 years ago

Update: I tried the same command without -CB and -UMI flags using the previous version (which worked before even though the barcode didn't relate to corrected ones), but it showed similar result to the updated version, I wonder if it's the index file problem (I reindexed during reinstalling the new version). Let me reindex and try if it works. Thanks

leeyoyohku commented 3 years ago

Sadly, reindexing doesn't work.