smzt commented 1 year ago

Hi, There is an error in the code of Assemblefastq.R This line
if(! all(cutf(as.vector(fq1.f@id), d = " ", f = 1) == cutf(as.vector(fq2.f@id), d = "", f = 1))) stop("Read ID mismatch")

should be changed to this one if(! all(cutf(as.vector(fq1.f@id), d = " ", f = 1) == cutf(as.vector(fq2.f@id), d = " ", f = 1))) stop("Read ID mismatch") otherwise, read pairs will always mismatch.

Best regards,

S

petervangalen commented 1 year ago

Hi smzt, This is not an error, the code works as intended for us. If we want to find out why, it would be informative to compare our outputs of head(as.vector(fq1.f@id)) and head(as.vector(fq2.f@id)) Peter

smzt commented 1 year ago

Hi Peter, Thanks very much for you quick reply.

I was getting the error message "Read ID mismatch" with one of the samples from the paper where MAESTER was first published from GEO. I might be wrong but I think the is an error in the code. I compared the outputs of cutf(as.vector(fq1.f@id), d = " ", f = 1) against cutf(as.vector(fq2.f@id), d = "_", f = 1) and they are different so the program stops. The reason behind this error is because there is an underscore in the second fastq and a space for the first read so the names are always different.

Regards.

Sheila

petervangalen commented 1 year ago

The line is there as a safety measure to make sure the right fastq files are used. The read IDs from R1 and R2 should be the same, but part of a string with a different delimiter, so the IDs are extracted from that string and compared. If they do not match, the loop should stop with an error. This is intended. Is it possible for you to share the output of head(as.vector(fq1.f@id)) and head(as.vector(fq2.f@id))?

smzt commented 1 year ago

Hi, Here is the output of those commands.

head(as.vector(fq1.f@id)) [1] "SRR15598774.1 1 length=28" "SRR15598774.2 2 length=28" "SRR15598774.3 3 length=28" "SRR15598774.4 4 length=28" "SRR15598774.5 5 length=28" "SRR15598774.6 6 length=28" head(as.vector(fq2.f@id)) [1] "SRR15598774.1 1 length=264_GCCTGTTTCCGAACGC_TCATTCTGGCTT" "SRR15598774.2 2 length=264_AGCGATTGTCTTGAGT_CACAGTGCTATT" [3] "SRR15598774.3 3 length=264_CTGTCGTAGGATTTCC_CCGTCTGGATCG" "SRR15598774.4 4 length=264_CGCATAATCACTGAAC_CCCAAACAATTA" [5] "SRR15598774.5 5 length=264_GGTTCTCTCTCTCTAA_ACTAACCTCTCT" "SRR15598774.6 6 length=264_CTGCCATTCCAACCAA_TCTTTGCTGTGC"

I also add the output from the cutf function.

head(cutf(as.vector(fq1.f@id), d = " ", f = 1)) [1] "SRR15598774.1" "SRR15598774.2" "SRR15598774.3" "SRR15598774.4" "SRR15598774.5" "SRR15598774.6"

head(cutf(as.vector(fq2.f@id), d = "_", f = 1)) [1] "SRR15598774.1 1 length=264" "SRR15598774.2 2 length=264" "SRR15598774.3 3 length=264" "SRR15598774.4 4 length=264" "SRR15598774.5 5 length=264" [6] "SRR15598774.6 6 length=264"

Read names are always different because of the underscore. It might be a problem related to read names. Here are a few lines of my fastq files downloaded from GEO. R1 reads: @SRR15598774.1 1 length=28 GCCTGTTTCCGAACGCTCATTCTGGCTT +SRR15598774.1 1 length=28 FFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR15598774.2 2 length=28 AGCGATTGTCTTGAGTCACAGTGCTATT +SRR15598774.2 2 length=28 FFFF:::FFF:FFFFFF::FFFFFFFFF

R2 reads: @SRR15598774.1 1 length=264 NAGCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAAAGTATAGGCGACCTCGGAGCAGAACCCAACCTCCGAGCAGTACATGCTAAGACTTCACCAGTCAAAGCGAACTACTATACTCAATTGATCCAATAACTTGACCAACGGAACAAGTTACCCTAGGGCTAACAGCGCAATCCTATTCTAGAGTCCATATCAACAATAGGGGTAACGACCTCGATGTTGGAGCAGGACATCCCGA +SRR15598774.1 1 length=264

FFFFFFFFFFFFFFFFF,FFFF:FF,FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFF,FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFF::,FFFFFFFFF:FFF:F:F,,FFFF,F,F,,,:F,:FFFF::F:,,,,F:,F:F:FFF::FFFF,FF:F:F,:FFF:,

@SRR15598774.2 2 length=264 NTGCCCTCCTTTTACCCCTACCATGAGCCCTACAAACAACTAACCTGCCACTAATAGTTATGTCATCCCTCTTATTAATCATCATCCTAGCCCTAAGTCTGGCCTATGAGTGACTACAAAAAGGATTAGACTGAACCGAATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATATCAATGTGAAGAAAGAAAATCGCTGATAGGGGAGAGGGTCGTGGTGGGAAAGATATTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I checked that all names were identical just changing the underscore for an space here cutf(as.vector(fq2.f@id), d = " ", f = 1) and then all worked.

Thanks in advance.

Best regards,

Sheila

petervangalen commented 1 year ago

The issue is that your fastq sequence identifiers are different from the fastq sequence identifiers I'm used to. Hopefully this won't cause any issues downstream. I could look into the specific fastq if you let me know how you downloaded them.

smzt commented 1 year ago

Hello, The CB and UMI are not properly included in the read names so the pipeline fails.

The data I'm analyzing was downloaded from GEO. Study ID GSE182685. Samples: K562-BT142_10x_scRNAseq (SRR15598773) and K562-BT142_10x_MAESTER (SRR15598774).

Could you provide an example of the read naming convention that the fastq files should have?

Thanks in advance.

Best regards,

Sheila

smzt commented 1 year ago

Hello, After running Assemble_fastq.R, I changed fastq read names from this format @SRR15598774.1 1 length=264_GCCTGTTTCCGAACGC_TCATTCTGGCTT to this format @SRR15598774_2_AGCGATTGTCTTGAGT_CACAGTGCTATT.

Now I can see the cell barcodes and UMI in the BAM files in the same flags I see in the BAM file from scRNASeq data obtained with cell ranger.

BAM from cell ranger: SRR15598773.lite.1.146897773 16 chr1 28826 3 91M * 0 0 GGAGTTACAAGCCTCGCTGTAGGCCCCGGGAACCCAACGCGGTGTCAGAGAAGTGGGGTCCCCTACGAGGGACCAGGAGCTCCGGGCGGGC ??????????????????????????????????????????????????????????????????????????????????????????? NH:i:2 HI:i:1 AS:i:89 nM:i:0 RE:A:I xf:i:0 CR:Z:GAACACTTCGAGTACT CY:Z:???????????????? CB:Z:GAACACTTCGAGTACT-1 UR:Z:GCCAAGAAGGGC UY:Z:???????????? UB:Z:GCCAAGAAGGGC RG:Z:scRNASeq:0:1:unknow_flowcell:0

BAM from MAESTER data: SRR15598774_259329 0 chrM 680 255 240M * 0 0 TTAGTAAGATTACACATGCAAGCATCCCCATTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGGACAAGCATCAAGCACGCAGCAATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAACAAGCAGTGATTAACCTTTAGCAATAAACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGTTTAACCCAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFF,FFFFFFFFFFFF:F,,:FFF:FFFFFFF:FFFF,FFFFF:FFFFFFFFFFF,FFF:FF,FFFF,FF:F,FF,F:FFFFF,,F,FFF:FFFFFFFF:,FFFF:F:F,:,F:,F,:,FFF,, NH:i:1 HI:i:1AS:i:228 nM:i:5 CB:Z:TGAGGTTAGCAACCAG-1 UB:Z:TGTAGCGAGTGA

Nevertheless, the pipeline still stops when running maegatk.

Please find the log file attached.

Edited: I opened a new issue in maegatk because the tool fails with the test data so I assume the problem is now only related to maegatk and not to the previous steps https://github.com/caleblareau/maegatk/issues/11

Any help would be much appreciated.

Best regards,

Sheila

maester.log

petervangalen / MAESTER-2021

Read mismatch ID message and error in code #4

FFFFFFFFFFFFFFFFF,FFFF:FF,FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFF,FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFF::,FFFFFFFFF:FFF:F:F,,FFFF,F,F,,,:F,:FFFF::F:,,,,F:,F:F:FFF::FFFF,FF:F:F,:FFF:,