ValueError: not enough values to unpack

lxxiaoxiaLi commented 2 years ago

Hi, Lolita, After running: python3 SVJedi//svjedi.py -v $dir/Chr12.INS.CW01.seq.vcf -r nip -i reads.pass.fastq.gz -o $dir/Chr12.INS.CW01.gy -t 4 -d ont

I got this error: Traceback (most recent call last): File "/public/home/Shang-team/project/lxx/software/SVJedi//svjedi.py", line 193, in main(sys.argv[1:]) File "/public/home/Shang-team/project/lxx/software/SVJedi//svjedi.py", line 186, in main genotype.genotype(paf_file, vcf_file, output_file, min_support, d_over, d_end, ladj) File "/public/home/Shang-team/project/lxx/software/SVJedi/modules/genotype.py", line 84, in genotype readId, readLength, readStart, readEnd, , refId, refLength, refStart, refEnd, match, blockLength, quality, *_ = line.split("\t") ValueError: not enough values to unpack (expected at least 12, got 3)

here is a line from one of the PAF files: f3e2543c-0f1b-45c5-be1e-42f29d21a75a 31035 8392 8644 + ref_nip.Chr12_19715631-239 10000 45 302 235 263 0 NM:i:28 ms:i:348 AS:i:348 nn:i:0 tp:A:S cm:i:11 s1:i:119 de:f:0.0856 rl:i:15 cg:Z:14M1D37M2I10M1I47M1I22M1D29M2D35M2D10M1D20M3D6M2I11M1D5M (base)

And it is strange that I only got this error on some samples from the same ONT sequencing batch. Please help me，Thanks Xiaoxia Li

llecompte commented 2 years ago

Hi Xiaoxia,

Thank you for using SVJedi. I'll try to fix this issue as soon as possible.

Could you please share with me the result of this command?

awk '( NF < 12 ){print $0}' yourfile.paf

Best, Lolita

lxxiaoxiaLi commented 2 years ago

Hi, Lolita

I'm really sorry for my very late answer. I've taken care of it. I have another question: For rice genomes, and I want to ask if it's ok if I use the default parameters below：

-dover Breakpoint distance overlap required (default 100 bp) -dend Soft-clipping length allowed to consider a semi-global alignment (default 100 bp) -ladj Length of sequences adjacent to each end of breakpoints (default 5,000 bp) -d/--data Type of sequencing data, either ont or pb (default pb)

In addition，I split the input file(v/--vcf Set of SVs in VCF) into small files，A chromosome is a file (where insertion and deletion are separated) ，such as Chr1.deletion.vcf, Chr1.insetion.vcf, Chr2.deletion.vcf, Chr2.insetion.vcf........ Does this process（python3 svjedi.py -v Chr1.deletion.vcf -a -i ） affect the accuracy of the results？

Please help me，Thanks Xiaoxia Li

llecompte commented 2 years ago

Hi Xiaoxia,

Can you tell me what you did to fix the problem with the PAF files, please? I was not aware that PAF files could have a variable number of fields.

Yes, I recommend using the default settings, especially for dover, dend, and ladj. But you should specify the type of sequencing data: ont or pb (--data). Let me know if you have HiFi data.

Finally, splitting the VCF files will have no impact if you use the same -a reference allele file each time.

Don't hesitate if you have any other requests.

Best, Lolita

clemaitre commented 2 years ago

Hi,

Regarding the idea of splitting the input VCF file, here are some additional considerations and recommendations that may be useful to others :

splitting the VCF file does not impact the results only if the reference_at_breakpoints.fasta file has already been created with the whole VCF file and is being used with the -a option. But, in this case, splitting the VCF file will not reduce the running time, on the contrary, exactly the same mapping step will be performed several times (as many time as the number of splitted files). So the running time will be unnecessarily multiplied by the number of splitted files.
splitting the VCF file from the beginning of the pipeline (before calling the command : python3 svjedi.py -v <set_of_sv.vcf> -r <refgenome.fasta> -i <long_reads.fastq>) may have impacts on the quality of results. SVJedi generates representative allele sequences for all alleles of all SVs in the input VCF, all these sequences are stored in a unique fasta file reference_at_breakpoints.fasta. Reads are then mapped against all these sequences in a single run. Depending on the content of this file, mapping qualities and filtering of a given read may vary. In particular if a read maps to several sequences, it is considered as multi-mapped and filtered out (not used for genotyping). By consequence, the more SVs in the VCF, the more accurate will be the estimated genotypes (since spurious mappings due to repeats will be identified as multi-mappings and filtered out) but more SVs may be "not genotyped" due to an unsufficient number of supporting reads (more NA values, in particular for close or overlapping SVs).
finally, splitting the input VCF file will not decrease the overall running time. The running time is mainly governed by the mapping step, and the mapping running time is mainly governed by the number of reads to map (and not the reference fasta file on which reads are mapped to). So splitting the VCF file may likely result in increasing the running time.

In summary, in most cases, splitting the input VCF file is not a good idea.

If you get too many "not genotyped" SVs due to close or overlapping SVs in the VCF, instead of splitting the VCF, consider using SVJedi-graph (an improvement of SVJedi based on a graph representation of variants) : https://github.com/SandraLouise/SVJedi-graph

Best, Claire

llecompte / SVJedi

ValueError: not enough values to unpack #14