odomlab2 / sci-rocket

Snakemake workflow for (pre-)processing sci-RNA-seq3 data
MIT License
3 stars 3 forks source link

PE 150 FastQ reads #15

Closed plhm closed 8 months ago

plhm commented 9 months ago

Hi there, J0bbie.

Thank you for such a nice pipeline. I was recently trained in the Odom Lab on how to make the sciSeq library prep with Lauren, and finally got the sequence data back.

The issue: we hired sequencing from a contractor (Novogene). They - as well as other companies that do NGS sequencing like Genewiz - have a policy that they only share BCL files for full lanes, and that they only do custom made sequencing if one purchases a full flow cell.

The situation: I have fastq files from PE 150bp reads.

I noticed that your pipeline takes in BSL files, and that it expects the R1 to have ~ 48bp in length. My issue / question is whether you have plans on:

  1. Adding an option in your pipeline that would take in either FastQ files, rather than BSL files, 2, Take in BSL / FastQ files with > 48bp.

Tbh, if you make option #1 available, then it is easy enough to trim R1 in the FastQ and proceed with the pipeline as it is.

Thanks much for your time.

P

J0bbie commented 9 months ago

Hi Pietro,

Sure, I can also make the barcode locations dynamic (definable in the config) if they are located somewhere else within the read. We convert the R1-sequence ourselves to a 48bp "cellular-barcode" + UMI sequence for downstream tagging with STARSolo; so as long as we can find all the barcodes, it should play nicely with all downstream scripts.

The only thing that needs to be checked if starting from the .fastq.gz is whether the P7+P5 barcode sequences are already in the read-name, like this:

@READNAME 1:N:0:CCGTATGATT+AGATGCAACT
               |----p7---|+|----p5----|: p5 is reverse-complemented during demuxxing.

Could you send me an example of an R1 read with the barcodes labeled, like so: https://odomlab2.github.io/sci-rocket/overview_methodology/#sample-demultiplexing-without-hashing

plhm commented 9 months ago

Hey, J0bbie.

Sorry I went MIA for the last couple of days, I'm back on the grid now.

Here is what I have for the first 'good' read in the file:

@LH00328:116:22GHJ3LT3:3:1101:1139:1048 1:N:0:AGAGGAGAAT+AAGATCCAAC

Which matches your description. I have also gone in, RC the primer sequences, and they do match what you described, with the p5 being the RC. The key difference, and honestly my main concern is the following (and here goes the actual DNA sequence for that same read):

| GNCGTAACT | CAGAGC | AAGTCTTA | TGGTCAGCCA | TTTTTTTTTTTTTTTTTTTT | ACGGTTCTGCACCTAATATTTATTAAAAAAATACAAAACTTCAAAAAGTTTCCATAGGAAAGCGATATGAATGTGCTATCCCTGTCTCTTATACACATNAGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACGCTAACGGATATCTAGTGGGGCGGTTTCTGCTTGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGNGGGGGGGGGGGGGG |

Where:

| Ligation | Primer | UMI | RT | RC PolyA | Presumed gibberish |

Is this what you are looking for? If the code deals well with those longer reads, than this wouldn't be a problem.

As a side note: I was able to get my hands at a BCL file. Having said that, I still believe that having code that would take in FastQ files would be beneficial for those of us that don't always have access to BCLs. Also, my BCLs will still have R1s > 48nt, so my concern about that remains.

Thanks much for responding so quickly, I much appreciate it.

J0bbie commented 9 months ago

I've removed some of the checks which required the R1 sequence to be a specific length. It seems like the R1 sequence is indeed the same in regards to the barcodes, so it should work as-is.

I just need to add some more support/documentation for starting from .fastq.gz files. I'm also fixing a lot of other issues / improvements (speed + accuracy) and will also add this to the new release (next week).

plhm commented 9 months ago

Hey, J0bbie. Sweet.

I've been banging my head against the wall a bit, and was wondering if I could give you some feedback as an outside user. If you're okay with that, could you let me know the best way to do that?

Also, am I correct in understanding that I in the 'rt' column within the samplesheet, I cannot have multiple 'rt' for the same sample?

Let me explain my situation. In our case, we had one sample (Danio rerio) that we distributed across multiple wells in a 96-well plate (P01-A01 through P01-H03, to be more precise). When I tried running the pipeline using the following setup:

path_bcl sequencing_name p5 p7 rt sample_name species n_expected_cells /project/kyathit/RawData/231208_sciseq-Pilot/usftp21.novogene.com/20231128_lh00328_0116_A22GHJ3LT3-L3 rerio_sci_test A11,B11,C11,D11,E11,F11,G11,H11 B01:B12 P01-A01:P01-H03 rerio_01 Drerio 6000

I got the following message:

[12/15/23 14:05:08] ERROR sci-rocket: Sanity check (Sample sheet) - Contains RT barcodes which are not defined in the barcodes file: P01-A01:P01-H03

Does this mean that I will have to put in each well as its own 'sample' in the samplesheet?

Thanks much.

J0bbie commented 9 months ago

Hi Pietro,

Good point, hadn't added the RT as also being dynamic (similar to p5/p7) if the same p5/p7 indexes are used. I'll add it to the next update this week.

Thanks for letting me know!

Best,

Job

J0bbie commented 9 months ago

I've added the dynamic RT (rt) and path to previous fastq (path_fastq_bcl) as input into the sample-sheet. See documentation for more information.

Let me know if this solves it for you!

Best,

Job

plhm commented 9 months ago

Hey, J0bbie. Thanks for all the updates. I'll give it a shot today and will keep you posted.

plhm commented 9 months ago

Alright, J0bbie. I was able to push through the pipeline up until demux_rocket.py . I am using the latest version of the software, I believe (I cloned the repository today, 2023-12-21). However, when running demux_rocket.py, I received the following error message:

2023-12-21 16:16:07 - (sci-rocket) INFO: Starting sample-based demultiplexing of rerio_sci-test: (R1) rerio_sci-test/raw_reads_split/R1_1-of-10.fastq.gz (R2) rerio_sci-test/raw_reads_split/R2_1-of-10.fastq.gz R1 sequence is not 34 nucleotides (LH00328:116:22GHJ3LT3:3:1101:1028:1048)!

Perhaps this was a secondary length check you missed after your first fix?

The conditional for that filter is at sciClasses.py.

     45         if len(self.read1.sequence) != 34:
     46             print("R1 sequence is not 34 nucleotides (%s)!" % self.read1.name)
     47             sys.exit(1)

That's very nice code you have there, J0bbie. I'm learning a lot from this. ^^ I'm guessing that by changing sciClasses and loading it as is in demux_rocket.py could do the trick?

-- Update --

Seems to be working. Just commented out the __validateR1(self) call in sciClasses.

J0bbie commented 9 months ago

Sorry about that one. It sneaked in during the refactoring. Also removed those lines from the script now.

plhm commented 9 months ago

NP. I'll close this one once I run the pipeline with the fastq files, is that okay with you? I should do this once I'm able to troubleshoot the current run. : )

plhm commented 9 months ago

Hey, J0bbie. Unfortunately I did end up getting results that don't make sense in the pipeline.

In short, my R1s end up with a bunch of 'G's because there is no more sequence after the basepair 70 or so (remember that I ran a 150bp PE sequence run). As a result, my final QC file had the 'rt' "GGGGGGGGGG," as being disproportionately represented (like > 80% of the time). I went in a quest and found the culprit. I reasoned that for the 'rt' barcode, rather than walking forward in the read, you might have walked backwards, which would have resulted in you looking at the last 10 or 11 basepairs at the end of a 150bp read, rather than position 22ish through 34ish in that same read. Going through your code I found the following under sciClasses.py:

             # Retrieve the RT barcode from R1 (last 10 bp).
             self.rt_sequence = self.read1.sequence[-10:]
         else:
             # Retrieve the RT barcode from R1 (last 10 bp, minus one).
             self.rt_sequence = self.read1.sequence[-11:-1]

Would you agree that this is causing the issue? If so, I'd suggest you change this guy a bit so that it can look for those barcodes on longer reads going from position 0 onwards, rather than length backwards.

Please let me know if this makes sense and I'll re-run the pipeline with your modifications.

------------------- Update --------------------

In the meanwhile, I modified the code, and I am running it as:

        if self.ligation_status == None or len(self.ligation_sequence) == 10:
            # Retrieve the RT barcode from R1 (last 10 bp).
            self.rt_sequence = self.read1.sequence[24:33]
        else:
            # Retrieve the RT barcode from R1 (last 10 bp, minus one).
            self.rt_sequence = self.read1.sequence[23:32]

------------------- Update 2023-12-26 --------------------

The pipeline ran all the way through. I am now focusing on some issues regarding the P7s and the P5s.

In short, it seems to be true now that when there’s the index1 (p7) as GGGGGGGG (i.e., no signal, which matches the end of an R1 reads with 150bp), the index2 (p5) is AGATCTCGGT.

plhm commented 8 months ago

J0bbie, hello!

I finally had the time to try to run the pipeline using the fastq-only. It unfortunately did not run. It threw an error saying that 'path_bcl' was not found. Do note that I ran the sample sheet with the following configuration:

path_bcl_fastq experiment_name p5 p7 rt sample_name species n_expected_cells /path/to/fastq sci-test A11:H11 B01:B12 P01-A01:P01-H03 rerio zfish 12000 /path/to/fastq sci-test A11:H11 B01:B12 P01-A07:P01-H09 albo zfish 12000 /path/to/fastq sci-test A11:H11 B01:B12 P01-A04:P01-H06 aesc zfish 12000 /path/to/fastq sci-test A11:H11 B01:B12 P01-A10:P01-H12 guppy guppy 12000

Withouth the "path_bcl" column in the file.

I was able to, however, hack the program. Move the files into a 'raw_reads' folder within the expected path, changed the name to that which the program expects, and it picked it up from there.

J0bbie commented 8 months ago

Hi Pietro,

Sorry about that, I missed a line where it was still requiring the path_bcl. If there is now the column path_fastq (with or without path_bcl), it will now correctly symlink those fastq files and start from there.

Regarding the RT sequence, I've indeed taken over your suggestion instead of searching backwards from the end.

plhm commented 8 months ago

Sweet! I think we're good to close this, then. I've done all the tweaks in my local version of the code, and it is working nicely. : ) Thanks for such a nice pipeline.