hyunhwan-jeong / CB2

CB2 is an R package which provides functions for hit gene identification and quantification of sgRNA (single-guided RNA) abundances for CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) pooled screen data analysis. Details are in Jeong et al. (2019) <doi:10.1101/gr.245571.118> and Baggerly et al. (2003) <doi:10.1093/bioinformatics/btg173>.
https://cran.r-project.org/web/packages/CB2/index.html
Other
7 stars 1 forks source link

Error in arising in run_sgrna_quant #11

Closed klychuk closed 4 years ago

klychuk commented 4 years ago

Hello,

I have previously run CB2 successfully and enjoyed the methods as well as documentation. When I went to run it on a different experiment I received this error. I thought it may have to do with my library construction but I used a python dictionary to populate the .fasta file so each value should be unique. The row names don't look like names and I'm not sure where the issue is coming from.


`Error` in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
Calls: run_sgrna_quant ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
In addition: Warning message:
non-unique values when setting 'row.names': ‘00:00:00_10’, ‘00:00:00_3’, ‘00:00:00_5’, ‘00:00:00_7’, ‘00:00:00_9’ "

thanks,

Karson

hyunhwan-jeong commented 4 years ago

I guess there is a duplication of sgRNA name in your library. Can you share the fasta file if you don’t mind? Thank you,

Hyun-Hwan Jeong

klychuk commented 4 years ago

Yes, I attached the file. github did not support .fasta so I added .txt to the end but it is still formatted as a fasta

40k_lib.fasta.txt

hyunhwan-jeong commented 4 years ago

Thanks for sharing it, and I found that there are some guide names like that, and the names cause the problem.

 >2020-09-10 00:00:00_1
 CGCGACCATGGCCTCCTCCG
 >2020-09-10 00:00:00_10
 GCTTAATCTTCAGTTCTTCT
 >2020-09-10 00:00:00_2
 CGCCACCTCGGAGGAGGCCA
 >2020-09-10 00:00:00_3
 GTGCCGCGCCACCTCGGAGG
 >2020-09-10 00:00:00_4
 CACCAGGTGCCGCGCCACCT
 >2020-09-10 00:00:00_5
 ATTCGTTCGTTGACTATGTC
 >2020-09-10 00:00:00_6
 TGCTTTAATATTCTCTGTGT
 >2020-09-10 00:00:00_7
 AAATCCCACTGTATTCACAA
 >2020-09-10 00:00:00_8
 ACCAAATAAATAAAGAAGAG
 >2020-09-10 00:00:00_9
 CTTCAGTTCTTCTTGGAGAT
 >2020-09-11 00:00:00_1
 TGCCGCAGCTGCGATGGCCG
 >2020-09-11 00:00:00_10
 TGTTTCAACATCCTTTGTGT
 >2020-09-11 00:00:00_2
 AGCTGCGATGGCCGTGGCCG
 >2020-09-11 00:00:00_3
 CTTCGAAACTTGTCTTTGTC
 >2020-09-11 00:00:00_4
 CTTGTCTTTGTCTGGCCATG
 >2020-09-11 00:00:00_5
 GGAGGCTGTCAAATCCCACA
 >2020-09-11 00:00:00_6
 TGACAGCCTCCCTGACCAGC
 >2020-09-11 00:00:00_7
 TGTTGACCAGCTGGTCAGGG
 >2020-09-11 00:00:00_8
 AAGTAGACTTGTTGACCAGC
 >2020-09-11 00:00:00_9
 GTCAACAAGTCTACTTCTCA
 >2020-09-12 00:00:00_1
 TGCGAGGACAGGCAGGGAGA
 >2020-09-12 00:00:00_10
 TGAGTTCAACATCATGGTGG
 >2020-09-12 00:00:00_2
 GAGGGCTGCGAGGACAGGCA
 >2020-09-12 00:00:00_3
 GGCTGGAGGGCTGCGAGGAC
 >2020-09-12 00:00:00_4
 GGACCAAGCATCTCGCAGGG
 >2020-09-12 00:00:00_5
 TGCGAGATGCTTGGTCCTGT
 >2020-09-12 00:00:00_6
 TGTGGGCATTGAGGCTGTGC
 >2020-09-12 00:00:00_7
 GCTGGACCAGCTGAAGATCA
 >2020-09-12 00:00:00_8
 TCATAGCCTTGATCTTCAGC
 >2020-09-12 00:00:00_9
 AAGATCAAGGCTATGAAGAT
 >2020-09-14 00:00:00_1
 TAGCATGGCAGAAAGAACAA
 >2020-09-14 00:00:00_10
 CTACATAGATGCCCAATTTG
 >2020-09-14 00:00:00_2
 ATTCGTTGTTTAACTACGAT
 >2020-09-14 00:00:00_3
 TGAATGTTTGCCCAATCAGT
 >2020-09-14 00:00:00_4
 AGATCTGCTCACCAACTGAT
 >2020-09-14 00:00:00_5
 GTGAGCAGATCTATCCGACA
 >2020-09-14 00:00:00_6
 TCAGTTGAAATTGACTGTTG
 >2020-09-14 00:00:00_7
 TGACTGTTGTGGAGACAGTA
 >2020-09-14 00:00:00_8
 GTTGTGGAGACAGTAGGGTA
 >2020-09-14 00:00:00_9
 ATCAAATAGACAAAGAAGCC
 >2020-09-01 00:00:00_1
 TCCATCATCGTGGTGAGACA
 >2020-09-01 00:00:00_10
 CCATAGGACAAGGAGTACGT
 >2020-09-01 00:00:00_2
 CACGATGATGGAGCTACAGT
 >2020-09-01 00:00:00_3
 GATGGAGCTACAGTGGGACT
 >2020-09-01 00:00:00_4
 CTTGGAATCCAGATGTGTGA
 >2020-09-01 00:00:00_5
 CCAGATGTGTGAAGGATGGA
 >2020-09-01 00:00:00_6
 TGTGAAGGATGGAGGGTTGA
 >2020-09-01 00:00:00_7
 AGAGACGGCAGGTGCAGTGA
 >2020-09-01 00:00:00_8
 GCAGGTGCAGTGATGGCTGG
 >2020-09-01 00:00:00_9
 AGTGATGGCTGGCGGAGTCA
 >2020-09-03 00:00:00_1
 AAAGGAGGATTCATGTCCAA
 >2020-09-03 00:00:00_10
 TTCATGGGCACCGCTGGCTT
 >2020-09-03 00:00:00_2
 CTGCAGGGCTCCCAGAGACC
 >2020-09-03 00:00:00_3
 CTGCGTCCGTCCTGGTCTCT
 >2020-09-03 00:00:00_4
 TGACATGGCTGCGTCCGTCC
 >2020-09-03 00:00:00_5
 GGACGCAGCCATGTCAGAGC
 >2020-09-03 00:00:00_6
 CTCAGGCACCAGCTCTGACA
 >2020-09-03 00:00:00_7
 CAGAGCTGGTGCCTGAGCCC
 >2020-09-03 00:00:00_8
 TGAGCCCAGGCCTAAGCCAG
 >2020-09-03 00:00:00_9
 GGGCACCGCTGGCTTAGGCC
 >2020-09-04 00:00:00_1
 GACTTTACCCTCATGGTGGC
 >2020-09-04 00:00:00_10
 AGAGAGGATCATGCAAACTG
 >2020-09-04 00:00:00_2
 TCTCTCCTCTCAGGAGAGTC
 >2020-09-04 00:00:00_3
 CTCTCAGGAGAGTCTGGCCT
 >2020-09-04 00:00:00_4
 TGACAAGTGTGGATTTGCCC
 >2020-09-04 00:00:00_5
 GAAGAGGCTATTGACAAGTG
 >2020-09-04 00:00:00_6
 CTTCCTCACTGATCTGTACC
 >2020-09-04 00:00:00_7
 TCACTGATCTGTACCGGGAC
 >2020-09-04 00:00:00_8
 CACCAAGAAGTTTCCGGTCC
 >2020-09-04 00:00:00_9
 CGGAAACTTCTTGGTGCTGA
 >2020-09-05 00:00:00_1
 CGTACTGCTTGTCAATGTCC
 >2020-09-05 00:00:00_10
 GTGTGCAGGTGAGTCAGGCC
 >2020-09-05 00:00:00_2
 CTTCGCCACACTGCCCAACC
 >2020-09-05 00:00:00_3
 GTGCACCTGGTTGGGCAGTG
 >2020-09-05 00:00:00_4
 GACTTGCGGTGCACCTGGTT
 >2020-09-05 00:00:00_5
 TCACCGACTTGCGGTGCACC
 >2020-09-05 00:00:00_6
 CACCGCAAGTCGGTGAAGAA
 >2020-09-05 00:00:00_7
 AGGCTTTGACTTCACACTCA
 >2020-09-05 00:00:00_8
 GACTTCACACTCATGGTGGC
 >2020-09-05 00:00:00_9
 TACCTGTGTGCAGGTGAGTC
 >2020-09-06 00:00:00_1
 AGCGACCGATATAGCTCGCC
 >2020-09-06 00:00:00_10
 TGCTTCAACATCCTGTGCGT
 >2020-09-06 00:00:00_2
 TACCACCTGGCGAGCTATAT
 >2020-09-06 00:00:00_3
 CTCCTTCCAAATTAGGGTGA
 >2020-09-06 00:00:00_4
 TGACAGCTTGCCTGACCAGC
 >2020-09-06 00:00:00_5
 GACTTATTCACCAGCTGGTC
 >2020-09-06 00:00:00_6
 TGACGGACTTATTCACCAGC
 >2020-09-06 00:00:00_7
 GTGAATAAGTCCGTCAGCCA
 >2020-09-06 00:00:00_8
 GAAGCAGAAGCCCTGGCTGA
 >2020-09-06 00:00:00_9
 GGATGTTGAAGCAGAAGCCC
 >2020-09-08 00:00:00_1
 CAACACGACCTTCGAGACTG
 >2020-09-08 00:00:00_10
 TTGTGGATGCCGTGGGCTTT
 >2020-09-08 00:00:00_2
 ACTGGCTTCCTCAGTCTCGA
 >2020-09-08 00:00:00_3
 TGAGGAAGCCAGTCACCATG
 >2020-09-08 00:00:00_4
 CACGCATGCCTCATGGTGAC
 >2020-09-08 00:00:00_5
 ATGAGGCATGCGTGCGCCTG
 >2020-09-08 00:00:00_6
 TCTCCTGGAGGTCATAGGTC
 >2020-09-08 00:00:00_7
 TGAGCTGCACGTTGCTCTCC
 >2020-09-08 00:00:00_8
 GCAGCTCAAGCTGACCATTG
 >2020-09-08 00:00:00_9
 CTGACCATTGTGGATGCCGT
 >2020-09-09 00:00:00_1
 TGCTTGAGCCCGGCATCTCT
 >2020-09-09 00:00:00_10
 TTACCCAAGCCGCTCTGCCC
 >2020-09-09 00:00:00_2
 TGCAGGCGCCTGCTTGAGCC
 >2020-09-09 00:00:00_3
 TCAAGCAGGCGCCTGCATCA
 >2020-09-09 00:00:00_4
 GCCTGCATCACGGAACGAGA
 >2020-09-09 00:00:00_5
 CCACGTAGCCGAAGTCCACC
 >2020-09-09 00:00:00_6
 CCATCCTGGAGCAGATGCGC
 >2020-09-09 00:00:00_7
 GCGCCGGAAGGCCATGAAGC
 >2020-09-09 00:00:00_8
 GAACTCGAAGCCCTGCTTCA
 >2020-09-09 00:00:00_9
 GGGCTTCGAGTTCAACATCA

The main problem is that space between date and time faces you the duplication problem. The easiest solution is replacing a space between the date and the time with a character like -.

I believe these genes should be named SEPTX (where X is a number), and the name was what you expected to see. I guess you or the original data provider used Microsoft Excel during the data processing, and Excel converts but Excel forces to convert the name to date. Let me know if you think it is a problem and need my help.

Thank you,

Hyun-Hwan Jeong

klychuk commented 4 years ago

Thank you so much!