rvolden / C3POa

Computational pipeline for calling consensi on R2C2 nanopore data
GNU General Public License v2.0
30 stars 16 forks source link

C3POa_preprocessing halts when parsing splint blat alignments #5

Closed claumer closed 5 years ago

claumer commented 5 years ago

Hi there,

I've been trying to run C3POa to error correct some RCA data (albeit using a custom protocol not R2C2 - thank you for writing a program flexible enough to accommodate new splint sequences).

I think the program installed correctly and ran normally - it creates a "R2C2_temp_for_BLAT.fasta" and a "Splint_to_read_alignments.psl", the latter of which has 349408 lines - so we are clearly finding my custom splint sequence all over the place as expected.

However, the program halts after this, throwing to sterr:

Traceback (most recent call last): File "C3POa_preprocessing.py", line 214, in main() File "C3POa_preprocessing.py", line 209, in main adapter_dict = parse_blat(output_path) File "C3POa_preprocessing.py", line 148, in parse_blat adapter_dict[read_name][strand].append((adapter, float(a[0]), position)) KeyError: '495d8ff0-8c89-4f54-bca7-9344e74a1951'

This is the name of the first read in the Splint_to_read_alignments.psl file. So I'm reading this as a problem with the parsing? The .psl has this format (first 10 lines):

64 1 0 0 0 0 1 2 - 495d8ff0-8c89-4f54-bca7-9344e74a1951 1358 545 610 Splint_TruSeq 67 0 67 2 12,53, 748,760, 0,14, 65 0 0 0 1 1 1 2 + 71b57cbd-a089-456d-8822-2ff160f5c5fc 1066 275 341 Splint_TruSeq 67 0 67 3 7,42,16, 275,283,325, 0,7,51, 25 0 0 0 0 0 1 3 - 71b57cbd-a089-456d-8822-2ff160f5c5fc 1066 798 823 Splint_TruSeq 67 35 63 2 7,18, 243,250, 35,45, 26 0 0 0 1 41 1 41 + 22de2a2c-2a66-4fb9-a79b-7ab146ef45a2 1208 355 422 Splint_TruSeq 67 0 67 2 13,13, 355,409, 0,54, 65 1 0 0 1 1 1 1 - 22de2a2c-2a66-4fb9-a79b-7ab146ef45a2 1208 355 422 Splint_TruSeq 67 0 67 3 17,8,41, 786,804,812, 0,17,26, 58 0 0 0 1 4 2 9 + 9c6466b8-df94-44cf-9d2d-102aff9e1e51 1031 364 426 Splint_TruSeq 67 0 67 3 28,12,18, 364,396,408, 0,35,49, 26 0 0 0 1 36 1 41 - 9c6466b8-df94-44cf-9d2d-102aff9e1e51 1031 364 426 Splint_TruSeq 67 0 67 2 13,13, 605,654, 0,54, 60 1 0 0 0 0 3 6 + b9318330-dbb3-4aff-8a22-fe48469af700 1047 414 475 Splint_TruSeq 67 0 67 4 10,23,10,18, 414,424,447,457, 0,13,37,49, 59 0 0 0 3 4 4 8 + 7e872ed6-c0cb-4cf1-a9e1-8f2f7e9e50fc 1298 1154 1217 Splint_TruSeq 67 0 67 6 7,13,5,9,7,18, 1154,1162,1177,1182,1191,1199, 0,7,21,27,38,49, 59 3 0 0 1 6 1 5 - 67c37574-ba0a-4a3a-aff4-f05cf80a894c 1047 928 996 Splint_TruSeq 67 0 67 2 16,46, 51,73, 0,21,

Would be grateful for your help in figuring out what's going wrong - I'm really excited to have a crack at these data.

Regards, Chris L

EDIT - I'm running python 3.6.8 as part of the Anaconda distribution, if that helps at all.

rvolden commented 5 years ago

Would you be able to send me your input fastq file? I can't reproduce this error on any test data that I have. You can email to rvolden@ucsc.edu Edit: also your splint sequence

claumer commented 5 years ago

Thanks Roger. I've sent you some test data.

A very similar error occurs for me when I try C3POa on the first 10,000 reads from your B-cell dataset, using the R2C2 splint:

Traceback (most recent call last): File "C3POa_preprocessing.py", line 214, in main() File "C3POa_preprocessing.py", line 209, in main adapter_dict = parse_blat(output_path) File "C3POa_preprocessing.py", line 148, in parse_blat adapter_dict[read_name][strand].append((adapter, float(a[0]), position)) KeyError: 'SRR6924616.16'

So I would reckon that this is a more general problem, perhaps something related to my specific python environment (which otherwise performs very nicely for me).

claumer commented 5 years ago

A little update --

I had a look at the C3POa_preprocessing.py code - it seems that there was a problem with inconsistent read names between the blat output (which contained only the first part of the read name, not the other metadata) and what was being loaded into memory (the full fastq header). I hackishly made some edits to focus only on the read name sensu stricto:

line 126 name = line[1:].strip().split()[0] #CEL added .split(' ')[0]:

line 155 name, sequence, quality = read.split()[0], reads[read][0], reads[read][1] #CEL added .split()[0]

And now it seems to work - creating numbered subdirectories in a splint_reads folder, each holding an 'R2C2_raw_reads.fastq' file of I assume 4000 seqs long. I assume one would need to cat these together before running C3POa.py, or else run separate instances of C3POa.py in each subfolder.

Running C3POa.py on a test subset gives its own interesting set of python errors:

[claumer@hh-yoda-08-01 C3POa]$ python C3POa.py -p ./ -m /nfs/leia/research/marioni/claumer/C3POa/NUC.4.4.mat -c paths.txt -l 1000 -d 500 -o R2C2_Consensus.fasta -r R2C2_raw_reads.fastq rm: cannot remove â.//tmp1â: No such file or directory R2C2_raw_reads.fastq /nfs/leia/research/marioni/claumer/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2920: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /nfs/leia/research/marioni/claumer/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) .//tmp1/7c8cab07-64a2-4196-9da6-42b9dd4666c9_consensus_1.fasta .//tmp1/fadb7f9d-11cc-4fe2-ab9a-2cd0dee3e0a8_consensus_1.fasta .//tmp1/d4718a9b-ea5a-407e-8b17-1902b2b0b5fa_consensus_1.fasta etc etc

And leaves a lot of cluttered files in the working directory:

-rw-r--r-- 1 claumer marioni 671826 Feb 8 15:28 subreads.fastq -rw-r--r-- 1 claumer marioni 1052 Feb 8 15:28 seq2.fasta -rw-r--r-- 1 claumer marioni 1052 Feb 8 15:28 seq1.fasta -rw-r--r-- 1 claumer marioni 143 Feb 8 15:28 racon_messages.txt drwxr-sr-x 2 claumer marioni 8192 Feb 8 15:28 tmp1 -rw-r--r-- 1 claumer marioni 772 Feb 8 15:28 minimap2_messages.txt -rw-r--r-- 1 claumer marioni 197 Feb 8 15:28 poa_messages.txt -rw-r--r-- 1 claumer marioni 199053 Feb 8 15:27 R2C2_Consensus.fasta -rw-r--r-- 1 claumer marioni 383 Feb 8 15:11 paths.txt

But it does give me an R2C2_Consensus.fasta file which at first glimpse seems like it might contain error corrected sequences.

But thought I would check in first: are the numpy errors (and the 'rm: cannot remove â.//tmp1â: No such file or directory') I cite above any cause for concern? Should these other temporary files indeed be left laying around?

rvolden commented 5 years ago

I've pushed the changes necessary for the key error to stop happening (definitely my fault, just changed the fastq parser so that it ignores the metadata).

As for the C3POa errors, those are all normal. At the beginning it tries to get rid of a directory in case you run multiple times. The numpy errors don't really mean much, and don't impact functionality.

As for the temporary files, I use them for debugging, particularly the *_messages.txt files. We use subreads.fastq in other analyses. The only real junk files are the seq1/2.fasta files which are used in the alignment. I could get rid of those, but I use them to check up on things once in a while.

claumer commented 5 years ago

Awesome, thanks for that.

It does indeed seem to be working. A little bit of extra documentation about how to read what's added to the headers, how many subreads were assigned to each consensus read, etc, would be really useful, if you feel so motivated :-)

One other thing - it seems that most of the reads really are legit corrected, which is great... but I do notice that there are quite a few larger reads that make it into the R2C2_Consensus.fasta which are clearly uncorrected concatemers still. Perhaps this could have something to do with the q-score threshold (I picked 8, which I suppose is going to let some dodgy reads in)... but, any suggestions on how to toy with sensitivity/specificity of splint detection beyond this general read quality filter?

rvolden commented 5 years ago

I've added documentation to the readme on how to read the headers after preprocessing and consensus calling.

I think another possibility is that it's having trouble since your splint is really short. This will also depend on your insert length, but generally the way we smooth over the scores is dependent on a sliding window, and it's hard to call peaks that are more often than every 450 bases. It's also much harder to get more accurate splint alignments when your splint is ~65 bases as opposed to ~200. I'm going to play around with making figures to see if I can get this to work with your data.

Edit: after looking at these files a bit more closely, it's definitely a splint thing. Of the 1000 reads you gave me, you lose about 60% after preprocessing. This just means that blat couldn't find a splint in those sequences. I'm not really sure what you can do about this besides play around with blat parameters.

claumer commented 5 years ago

Hi Roger,

Well - this is pilot data, so it's not surprising that there are some kinks still to work out. It is possible to increase the splint length, to perhaps ~120 bp... there are a few downsides but if the efficiency of splint detection improves, perhaps it will be a worthwhile trade. I played around a little with blat parameters in the hardcoded section and it made a marginal difference at best...

Thanks for the input, and the extra documentation.

-CEL