Best way to process Fastq files from SRA

ckr123tw commented 1 month ago

Hi, thank you for developing the great tool. I have been trying to run TRUST4 on all 10X data downloaded from NCBI sequence read archive (SRA). Things have been smooth when the libraries were sequenced according to 10X suggestion (26-28bp R1, 90-98bp R2).

For fastq files that are equal in length (usually 150bp for both R1 and R2), the cellular barcodes do not seem to be in the 5' end. The fastqc report of an example is attached. I am suspecting that these are TrueSeq adaptors, and preliminary examination of the reads seems to support this suspicion.

What would be the best practices to tell TRUST4 regarding where the barcode is for these files? I have been using the sequence content plot to do so. Removing adapters using packages like cutadapt with known TrueSeq adapter sequences is another option, but 10X generally advise against trimming adaptors. I am curious about how you would proceed with these data.

mourisl commented 1 month ago

Do you mean the barcode,UMI sequence is not in a fixed position of the read? Could they be in the 3'-end of the read. I think you can also use the barcode whitelist to confirm the positions.

ckr123tw commented 1 month ago

Different studies have reads in different positions. Using the whitelist is a great idea! I will test this solution to confirm the positions

liulab-dfci / TRUST4

Best way to process Fastq files from SRA #311