rhpvorderman / sequali

Fast sequencing data quality metrics
GNU Affero General Public License v3.0
12 stars 0 forks source link

ValueError: not enough values to unpack #96

Closed OZTaekOppa closed 5 months ago

OZTaekOppa commented 9 months ago

Hi,

I've followed the steps you suggested (or explained in the manual). Here's a summary of what I did:

  1. Installed sequali via conda (python 3.8).

  2. Added the ultra-long RA as the only "top strand" and "start" to the adapter_list.tsv file:

    Ligation kit and adapter mix are indistinguishable

    Oxford nanopore ultra-long ligation kit or Rapid Adapter (RA), top strand nanopore TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT start

  3. Commented out all other sequences in adapter_list.tsv except for the ultra-long RA added above.

  4. Executed the following commands:

ASFL='/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/lib/python3.8/site-packages/sequali/adapters/adapter_list.tsv'

sequali /ONT_raw_data/OutFQ/reads.fastq --adapter-file "$ASFL" --outdir /ONT_raw_data/OutFQ/ -t 8 sequali /ONT_raw_data/OutFQT1/reads.fastq --adapter-file "$ASFL" --outdir /ONT_raw_data/OutFQT1/ -t 8 sequali /ONT_raw_data/OutFQT2/reads.fastq --adapter-file "$ASFL" --outdir /ONT_raw_data/OutFQT2/ -t 8

  1. Reviewed the log and error files.

^MProcessing reads.fastq: 0%| | 0.00/149G [00:00<?, ?iB/s]Traceback (most recent call last): File "/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/bin/sequali", line 11, in sys.exit(main()) File "/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/lib/python3.8/site-packages/sequali/main.py", line 151, in main adapters = list(adapters_from_file(args.adapter_file, seqtech)) File "/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/lib/python3.8/site-packages/sequali/adapters.py", line 42, in adapters_from_file name, seqtech, sequence, position = line.split("\t") ValueError: not enough values to unpack (expected 4, got 1) ^MProcessing reads.fastq: 0%| | 0.00/149G [00:00<?, ?iB/s] ^MProcessing reads.fastq: 0%| | 0.00/149G [00:00<?, ?iB/s]Traceback (most recent call last): File "/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/bin/sequali", line 11, in sys.exit(main()) File "/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/lib/python3.8/site-packages/sequali/main.py", line 151, in main adapters = list(adapters_from_file(args.adapter_file, seqtech)) File "/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/lib/python3.8/site-packages/sequali/adapters.py", line 42, in adapters_from_file name, seqtech, sequence, position = line.split("\t") ValueError: not enough values to unpack (expected 4, got 1) ^MProcessing reads.fastq: 0%| | 0.00/149G [00:00<?, ?iB/s] ^MProcessing reads.fastq: 0%| | 0.00/149G [00:00<?, ?iB/s]Traceback (most recent call last): File "/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/bin/sequali", line 11, in sys.exit(main()) File "/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/lib/python3.8/site-packages/sequali/main.py", line 151, in main adapters = list(adapters_from_file(args.adapter_file, seqtech)) File "/g/data/te53/hj3792/Tmp_Prgms/miniconda3/envs/trfpy38/lib/python3.8/site-packages/sequali/adapters.py", line 42, in adapters_from_file name, seqtech, sequence, position = line.split("\t") ValueError: not enough values to unpack (expected 4, got 1) ^MProcessing reads.fastq: 0%| | 0.00/149G [00:00<?, ?iB/s]

Your assistance on this matter is greatly appreciated.

rhpvorderman commented 9 months ago

Added the ultra-long RA as the only "top strand" and "start" to the adapter_list.tsv file:

The rapid adapter is already included. Its sequence is the same as the ligation kit adapter. There is no need to modify the adapter list file.

sequali /ONT_raw_data/OutFQ/reads.fastq --adapter-file "$ASFL" --outdir /ONT_raw_data/OutFQ/ -t 8

Using more than 2 threads does not make any difference. The program has no threading support. It only uses an extra thread for gzip decompression. If there is no gzip or bam compression, it works single threaded anyway. It is quite fast regardless.

Your assistance on this matter is greatly appreciated.

You made an error in the format for the adapter list file. It is a four column TSV.

Since you adapted the adapter list file that was included with the installation. The best way forward is to uninstall sequali, install it again and then run the program with the default settings. If it detects a lot of ligation kit adapter, you will know the rapid adapter is still in there.

OZTaekOppa commented 9 months ago

Thank you for your reply. It worked well. However, I have a few more questions.

Q1. Adapter sequence and adapter_list.tsv: Based on the ONT team, they confirmed that the Rapid Adapter (RA) is as follows: 5’-TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT-3’ You mentioned that the RA sequence is already included as the ligation kit adapter in the default adapter_list.tsv. However, I could not find any RA sequences that match in the default adapter_list.tsv. Could you please enlighten me a little bit further?

Q2. Interpretation: The outcome looks really user-friendly. To understand my main objective (to detect the ONT RA sequence), I might have to pay attention to the "Adapter content" in the ".html" summary file. In relation to Q1, I am a bit unclear about whether the RA sequences are actually in the default adapter_list.tsv. Despite this, I executed the program in its default mode. Here is the outcome.

image

Given the circumstance (default mode run), which one (adapter content) should I pay attention to understand whether the ONT Rapid Adapter (RA) sequences still remain in my dataset? Also, it says "For nanopore the the adapter mix (AMX) and ligation kit have overlapping adapter sequences and are therefore indistinguishable. Please consult the [nanopore documentation] for more information which adapters are used by your kit." Could you please enlighten me a little bit further?

rhpvorderman commented 8 months ago

Q1. Because nanopore reads suffer from mutations at a higher rate it is almost impossbile to find perfect matches. There are two options for this:

Approximate matching is quite slow and hard to scale. Exact matching of a smaller subsequence however can use the methodology described here. Which allows for scanning multiple sequences at once. This is what sequali uses. Hence it can scan at 500 megabases per second for more than 15 adapter probes.

So that is why only 12 bp adapter stubs are used. The one used for the ligation kit is TTACGTATTGCT. Which is exactly the sequence that would have been used for the Rapid Adapter. You are correct that I should put this in a comment in adapter_list.tsv. I will put it on the todo list.

Q2: Your plot looks good. PolyA sequences are overrepresented in the human genome in general. I wonder how your FASTQ headers look. Sequali should automatically detect that these are nanopore sequences and not include PolyA and PolyG.

The rest of the hits are probably false positives. Looks like your basecaller did a good job adapter trimming and read splitting!

As for the comment about overlapping sequences: if only 12bp are taken, it is impossible to distuingish some nanopore adapters. This is mostly important if they are found and need to be trimmed with cutadapt. Cutadapt uses an approximate matching algorithm, and thus needs the full-length sequence for the best results.

OZTaekOppa commented 8 months ago

Thank you for your clarification.

Based on the current sequali results and your explanation, I am confident that there are no ONT Rapid Adapter (RA) sequences in my tested dataset, thanks to the excellent job of buttery-eel, which handles basecalling, adapter trimming, and read splitting effectively.

Regarding overlapping sequences and potential noise in ONT reads, I extended my testing to the basecalled FASTQ using cutadapt to identify and remove any residual adapters that might not have been detected in the initial buttery-eel pipeline. The combination of buttery-eel, sequali, and cutadapt ensures a thorough and reliable processing of ONT reads for end-users.

rhpvorderman commented 8 months ago

The combination of buttery-eel, sequali, and cutadapt ensures a thorough and reliable processing of ONT reads for end-users.

Thanks for the compliment :smile: . I made sequali in the hope that it was useful. Also thanks for mentioning about buttery-eel. I did not know about that program.

rhpvorderman commented 6 months ago

I made documentation about the adapter content module: https://sequali.readthedocs.io/#adapter-content-module This does explain the structure of the adapter file. Do you consider this sufficient for the issue to be closed?