ucagenomix / sicelore-2.1

MIT License
13 stars 2 forks source link

SelectValidCellBarcode discards lot's of cell barcode #3

Closed yuntianf closed 1 year ago

yuntianf commented 1 year ago

Hi, I found that after SelectValidCellBarcode many valid barcodes are filtered out. Here is my order for this step: SelectValidCellBarcode -I ./data/readscan/BarcodesAssigned.tsv -O ./data/out/barcodes.csv -MINUMI 1 -ED0ED1RATIO 1 I think it should preserve barcodes with at least one UMI and have match of edit=0 > edit=1, while many barcodes satisifying this threshold are filtered out:

INFO    2023-01-20 17:07:31     SelectValidCellBarcode  TCAATCTTCGGATGTT barcode removed                [TCAATCTTCGGATGTT       355     166     104     85]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  CATTATCGTAGCGATG barcode removed                [CATTATCGTAGCGATG       347     182     86      79]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  TGGCTGGTCGAATGCT barcode removed                [TGGCTGGTCGAATGCT       340     166     99      75]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  GCGCAACTCCGTACAA barcode removed                [GCGCAACTCCGTACAA       338     178     86      74]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  TGAGCATCATCCGTGG barcode removed                [TGAGCATCATCCGTGG       330     165     94      71]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  GCAGTTACATCCCACT barcode removed                [GCAGTTACATCCCACT       324     160     89      75]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  CCTTACGTCTCGCATC barcode removed                [CCTTACGTCTCGCATC       320     150     74      96]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  GGATTACAGCTGATAA barcode removed                [GGATTACAGCTGATAA       320     154     92      74]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  GGCAATTGTAGGCTGA barcode removed                [GGCAATTGTAGGCTGA       311     145     88      78]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  TTCCCAGCATGGTTGT barcode removed                [TTCCCAGCATGGTTGT       309     159     87      63]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  ACGTCAATCAGCAACT barcode removed                [ACGTCAATCAGCAACT       307     152     96      59]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  CGCTATCAGTTGTCGT barcode removed                [CGCTATCAGTTGTCGT       305     157     78      70]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  CGTCACTCAATGTAAG barcode removed                [CGTCACTCAATGTAAG       305     165     70      70]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  AGCTCTCAGGTGCTAG barcode removed                [AGCTCTCAGGTGCTAG       302     151     86      65]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  GCAGCCAAGTCCAGGA barcode removed                [GCAGCCAAGTCCAGGA       298     166     73      59]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  CGGAGTCTCATGTAGC barcode removed                [CGGAGTCTCATGTAGC       295     158     82      55]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  GAAATGATCCCTCTTT barcode removed                [GAAATGATCCCTCTTT       292     166     63      63]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  CACCTTGTCTCGCATC barcode removed                [CACCTTGTCTCGCATC       291     148     78      65]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  CCTACACCAAAGCGGT barcode removed                [CCTACACCAAAGCGGT       279     142     91      46]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  GTCGGGTCAGCTGGCT barcode removed                [GTCGGGTCAGCTGGCT       275     125     83      67]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  CCTTTCTAGCTCAACT barcode removed                [CCTTTCTAGCTCAACT       226     114     61      51]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  GGACAGATCGAGAACG barcode removed                [GGACAGATCGAGAACG       220     106     58      56]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  TACCTTAGTCGAGTTT barcode removed                [TACCTTAGTCGAGTTT       214     124     46      44]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  CAGCATACATGACATC barcode removed                [CAGCATACATGACATC       214     102     61      51]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  GGATTACCATCTATGG barcode removed                [GGATTACCATCTATGG       214     109     47      58]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  AGATCTGTCTACTTAC barcode removed                [AGATCTGTCTACTTAC       208     111     53      44]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  AAGCCGCGTTAAGACA barcode removed                [AAGCCGCGTTAAGACA       206     100     59      47]
..............
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  Total cell barcodes             [356]
INFO    2023-01-20 17:07:31     SelectValidCellBarcode  Valid cell barcodes             [12]
[Fri Jan 20 17:07:31 EST 2023] org.ipmc.sicelore.programs.SelectValidCellBarcode done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=324403200

While when I check the output barcode.csv, I found that preserved barcodes has much fewer reads and larger edit distance,

Barcode n Reads with ED<=2 match        ED=0    ED=1    ED=2
ATTGGTGGTTTGACAC    20  10  10  
ATTCTACGTCTGGAGA    6   3   3   
CATATTCCATGTCCTC    4   2   2   
GTCGGGTCAGTGACAG    1       1   
TCTTTCCCAACTGCTA    1       1   
ACTTACTCAGCTTAAC    1       1   
CGGTTAAAGACATAAC    1       1   
GAACGGATCTTTAGTC    1       1   
TCTTTCCAGCTCCTTC    1       1   
TTATGCTAGCGATATA    1       1   
CGGAGCTTCTGGTGTA    1       1   
GTATTCTAGGACAGAA    1       1

I didn't find other parameters for SelectValidCellBarcode so I'm not sure if it's a bug or my order problem. Here is the data I used, in case it could help. https://drive.google.com/drive/folders/11P4r24DxEHM1tCm66vzPn0KXCoXEi0N6?usp=sharing

I will appreciate this if you have any thoughts or comments about this! Thanks

ucagenomix commented 1 year ago

Hi, you are right all those cell barcodes should be kept. The issue comes from the fact that you used ed=2 in the previous step and that the previous step should output "0" event instead of a blank, we should provide a fix rapidly, thanks for pointing this. In the meantime you can used a custom output barcodes.csv (1 cell barcode per line) file not filtering any cell barcodes for the next steps.

best kevin

ucagenomix commented 1 year ago

Please re-clone the repository, it should be fixed now

yuntianf commented 1 year ago

It works this time, thanks!