mcurto / SSR-GBS-pipeline

Pipeline to analyse SSR-GBS data (microsatellite loci sequenced with Illumina) based on amplicon length and sequence identity.
5 stars 1 forks source link

error using script 6 (correct_allele_sequence.py): IndexError: string index out of range #2

Open Robvh-git opened 4 years ago

Robvh-git commented 4 years ago

Hello Manuel,

I'm further experimenting with your scripts, but I receive an error which I cannot fix when running script 6: python correct_allele_sequence.py ./script5_output/Ti1_TG.fasta ./script4_output_test/ 1 ./script6_output/output_file.fasta

I receive following error:

  File "correct_allele_sequence.py", line 100, in <module>
    positions_filt = filter_Ns_records(positions, parse_fasta(fasta))
  File "correct_allele_sequence.py", line 44, in filter_Ns_records
    nuc.append(seq[i].upper())
IndexError: string index out of range

parameter 1 ./script5_output/Ti1_TG.fasta: Ti1_TG.fasta is a fasta file containing the 2 consensus sequences from this marker (i.e. a heterozygotous marker). I obtained this by merging the 2 script 5 output files of this marker by executing cat Ti1_TG_Sample1_Al_399_C_25_70.fasta Ti1_TG_Sample1_Al_401_C_77_70.fasta > Ti1_TG.fasta. Is this the correct way of obtaining this file? So Ti1_TG.fasta looks like:

>Ti1_TG_Sample1_Al_399_C_25_70
SEQUENCE
>Ti1_TG_Sample1_Al_401_C_77_70
SEQUENCE

This is just from 1 marker as a test, but I have these files for all markers from 1 sample you used in the Nile Tilapia research. In the file from this marker, no Ns are present. In the description you state for the first parameter of script 6: fasta file containing all consensus sequences from one marker., but could you clarify this? So per sample and marker, there can be 1 (if homozygote) or 2 (if heterozygote) consensus sequences in this file? Or do I misinterpret this?

Parameter 2 ./script4_output_test/: this is the script 4 output of marker Ti1_TG . This folder contains 2 files Ti1_TG_Sample1_Al_399.fasta and Ti1_TG_Sample1_Al_401.fasta, so the output from script 4 of this marker and sample.

parameter 3: 1: I use this number just as test

parameter 4: ./script6_output/output_file.fasta: the directory and file name where the output of this script should be saved.

Is this the correct input? I'm wondering if the reason for the error happend 'upstream' in scripts 3, 4 or 5 as I noticed some errors there.

In script 4, the description of the script seems outdated. Upon inspecting the script, I noticed that there are actually 4 arguments required instead of 3 as described. Also, I had to manually edit the .csv output file from script 3 before being able to use it in script 4: I had to perform following adjustments: 1)change too little reads to a 0. 2) delete _mancheck if present 3) in the first row (i.e. the header) I had to delete _1 and _2_ from the marker information. If I did not delete this, the script will search for .fastq files with _1 and _2 in it, where are not present in the output from script 1.

I had to manually edit script 5 as the output names where not correct. I did 3 adjustments which are marked in the attached file with comments #ROB EDIT: get_consensus_and_freq_edit.txt

I hope I described the problem and the origin of the files clear enough. Could you please advice me on how to solve this error?

Kind regards, Robbert

mcurto commented 4 years ago

Hi Robbert, thanks again for trying the pipeline and finding the mistakes in the scripts. I went through the scripts you mentioned and they are indeed outdated. Now I have have a script that runs the pipeline starting from step 4 until the end, which will overcome the formatting issues. Also, for this version of the script, it will not be necessary to edit the .csv matrix. I will upload it and update the readme information still this week. I am just doing some final tests. Best regards, Manuel

Robvh-git commented 4 years ago

Hi manual,

again, thank you for the quick reply. I am looking forward to the new script and the updated readme.

Thanks!

Robbert

mcurto commented 4 years ago

Hi Robber, I updated the pipeline. Please check it out and let me know if you have any questions or if you find any problem. Best, Manuel

Robvh-git commented 4 years ago

Hi Manuel,

thank you for updating the pipeline, the description looks much clearer now. Very nice that there is 1 script ( Sequence_Allele_Call.py) that replaces 4 others.

Best, Robbert