Blank .stats files? - Githubissues

mckennalab / SingleCellLineage

Updated scripts and pipelines for processing GESTALT data at single-cell resolution

19 stars 8 forks source link

Blank .stats files? #6

Closed oligomyeggo closed 3 years ago

oligomyeggo commented 3 years ago

Hi Aaron, I am running through the 10X version of your docker container using raw data files from Bushra Raj's latest 2020 paper in Neuron. For the reference files I am using DsRed.fa, DsRed.fa.cutSites, and DsRed.fa.primers, all of which Dr. Raj sent me after I asked her for the files she used to process the data in her 2020 paper. I am starting small with just one sample provided on GEO as my intention isn't to repeat those experiments but just to get your pipeline up and running in preparation for our own 10X experiment. While the script completes successfully and finished all jobs, I am getting back blank .stats files (as in, besides the header, they are totally empty) and am wondering how I can troubleshoot this. My guess is that I might have set something up wrong in either the tear_sheet.txt file or test_run.sh script, though I tried to keep them as close to the example provided in the README.md as possible.

If it's helpful, I used both runs from here for my fastq files: https://www.ncbi.nlm.nih.gov/sra?term=SRX9142639 and I have attached a zip file with the three reference files I used, my tear_sheet.txt file, and my test_run.sh script. Maybe there is something obvious that I overlooked that you can see without having to run anything. Thank for any help or insight you might be able to provide!

10X_scGestalt_test.zip

aaronmck commented 3 years ago

Hi OME,

I hadn't processed this data before so it took me a minute here; sorry for the delay. Given your tearsheet (I just ran the first sample) and your shell script I was able to get 5678 UMIs with 4206 passing. I'm not sure how Bushra ran the data, so I don't know if this matches up with her results, but something is coming out. If you're getting zero UMIs (empty stats) some obvious things would be the read1-read2 files swaped, as I can't see your reads files. A quick check is to look at the first few bases of the reads and see if they line up with the expected primers. Secondly, take a look at the UMI counts file (like zBr15dpf8.umiCounts) and see if reads are getting through this step? Lastly, check that the docker is up to date and you're running ":latest"?

oligomyeggo commented 3 years ago

No problem! I appreciate you taking a look!

I am not sure what's going on, but I am still getting empty stats. I checked my read1 and read2 file, and made sure that the sequence data (which is read2) is in the first FASTQ slot in the tearsheet and the 10X barcode/UMI (read1) is in the second FASTQ slot. Then I checked the sequence data (read2) and confirmed that they do line up with the expected primers (i.e., the primers contained in the reference.fa.primers file can both be found in the sequence data, on opposite sides of the reads). The reads are getting to the UMI counts file step, but when I looked at my zBr15dpf8.umiCounts file none of the counts are passing and they are all under the missingPrimer1 column so I wonder if something weird is going on there on my end? I tried to change the --primersToUse parameter of the shell script I sent you; it runs using FORWARD and BOTH, but that returns empty stats files. If I try to run it using NONE then the script fails. (And a side note about the shell script parameters: --umiLength is default set to 28, I am guessing based on 10X v3 chemistry with a 16 bp cell barcode and a 12 bp UMI. However, the data I am working with from Raj et al. 2020 used v2 chemistry with a 16 bp cell barcode and a 10 bp UMI. I tried changing the --umiLength to 26 to reflect this, but that caused the script to fail.) And finally, I confirmed that the docker is up to date and I am running ":latest" (I had even re-downloded it last week, just to be sure).

So, I am not sure what is going wrong on my end and why I am not getting passing UMI counts, especially since you do and we are using the same files/scripts. Seems like the issue has something to do with the UMI counts not passing for whatever reason on my end. Do you have any suggestions for how I can start to narrow in on the counts issue and see if I can track down the problem?

Thank you again for your patience and time!

aaronmck commented 3 years ago

Just to check I grabbed another set (SRR12661668) and ran that and it seemed to work (27440/27512 passing). I initially split this SRA with fastq-dump -I --split-files SRR12661668. My tearsheet has SRR12661668_3.fastq.gz as read 1 and SRR12661668_2.fastq.gz as read 2. For a check here, my first read 1 is: @SRR12661668.1.3 M00851:329:000000000-BM547:1:1102:15683:1339 length=260 GCTGCTTCATCTACAAGGTGAAGTTCATCGACGGCCCCGTGATGCAGAAGAAGCCACTACCTGGTGGAGTTC and my first read 2 is: @SRR12661668.1.2 M00851:329:000000000-BM547:1:1102:15683:1339 length=26 CGTCCATTCGAATGCTTAACAGTGTG I'm also running with --umiLength 26 for these samples. Maybe we can try with this sample and see if you get the same thing?

oligomyeggo commented 3 years ago

Unfortunately it is still not working. I split the SRR12661668 file using the same command as you and set up my tearsheet as you described. I checked my first reads for each file, and have what you listed above. I ran the shell script using --umiLength 26. So, everything looks like it should be exactly the same. However, when I ran it on my end the script failed. I see an error in FunctionEdge which says Contents of /app/my_test_run/data/pipeline_output/SRR12661668/SRR12661668.umi.fwd.fastq.out: Reading in sequences and parsing out UMIs (one dot per 100K reads, carets at 1M): ....Killed, and then later it looks like QGraph failed as well. Additionally, I am seeing that, despite setting --umiLength 26, it looks like something might still be calling a default of 28 somewhere (maybe in the beths_UMI_to_standard_UMI.scala script?). Is it possible that while the Docker container might have been updated, some of the code wasn't? (I am not super familiar with Docker, so I am not sure if that's a possibility.)

I am attaching a folder that contains: SRR12661668_test.txt, which just confirms we have the same first reads for read 1 and read 2; SRR12661668_tear_sheet.txt, to confirm the order of the read files and parameters; test_run.txt, to confirm parameters; and docker_output.txt, just in case there is something in there that makes more sense to you than to me.

SRR12661668_test.zip

Thank you!

aaronmck commented 3 years ago

Ahh gotcha; this is a little different, I'm guessing you ran out of memory. Usually (one dot per 100K reads, carets at 1M): ....Killed happens when the UMI collapsing step exceeds the set memory for the docker container and the pipeline has to kill it, hence the Killed bit. Do you know how much memory your docker container gets? You can also try cutting-and-pasting the failed command onto the Docker container's command line and running it to see what happens.

The 28 in the previous step should more flexible as well, though it won't matter here. It's just telling the script to take up to 28 bases of the 10X barcode read, which is only 26 in this data anyway. It was put in as some people sequence past the set 10X barcode length on that read. It doesn't affect the length of the downstream UMI calling.

oligomyeggo commented 3 years ago

Yes!!! That fixed it! My docker container was only set to 2 GB of memory. I adjusted that to 5 GB and then everything ran as expected. I also got 27440/27512 passing. Thank you so much for all of your help troubleshooting this! I really appreciate it.

aaronmck commented 3 years ago

Awesome, glad it worked! Let me know if you run into any other issues, though I'll close this for now.