UnboundLocalError: local variable 'polishing_pattern' referenced before assignment

gabyrech commented 2 years ago

Hi there! I am trying to use NGSpeciesID with some custom parameters but I am not sure if maybe I am doing something wrong... here my command:

NGSpeciesID --consensus --t 24 --fastq fastq_pass.fastq.fq --outfolder out --k 15 --w 50 --min_shared 40 --mapped_threshold 0.99 --aligned_threshold 0.80

This is what I get:

ITERATION 6
Using 1 batches.
Saved: 1742 iterations.
Iteration   NrClusters  MinDbSize   CurrReadId  ClusterSizes
10000   321 58310   7173c58f-47d2-4669-8b0b-e6f4bec1114b_runid=ac7c5aa48d1eef40ac1be6b86a0419b5de8fc6ff_read=62606_ch=99_start_time=2022-02-11T02:27:20.958109-05:00_flow_cell_id=AJE108_protocol_group_id=220210_22-gtct-009_GXB02036_001_sample_id=GT22-03433_GM12878_PacI_ExoV_SrfI_1_parent_read_id=7173c58f-47d2-4669-8b0b-e6f4bec1114b    2253,2019,1879,1204,1192,964,958,885,723,671,630,610,604,595,542,526,459,436,422,391,368,334,327,297,261,257,248,247,242,209,205,179,162,161,152,145,138,138,119,119,112,112,106,95,91,91,80,70,68,66,65,64,59,58,55,54,53,53,51,49,45,45,43,42,41,39,37,36,36,36,34,34,34,32,31,28,28,27,26,26,26,24,23,23,23,21,21,20,18,18,18,18,17,16,16,16,16,16,16,16,15,15,14,14,14,13,12,12,12,12,12,12,11,11,11,10,10,10,10,10,10,9,9,9,9,9,9,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
Total number of reads iterated through:11761
Passed mapping criteria:2295
Passed alignment criteria in this process:60
Total calls to alignment mudule in this process:781
Time elapesd clustering last iteration single core: 253.80852365493774
Time elapsed clustering: 981.8546669483185
Nr clusters larger than 1: 1351
Nr clusters (all): 9406

STARTING TO CREATE CLUSTER CONSENSUS

Temporary workdirektory for consensus and polishing: /tmp/tmporrs7wuf
0 centers formed
0 consensus formed.
Saving spoa references to files: /pod/2/wei-lab/rechg/projects/oreo/NGSpeciesID/OOC0001-1/consensus_reference_X.fasta
Traceback (most recent call last):
  File "/miniconda/envs/NGSpeciesID/bin/NGSpeciesID", line 289, in <module>
    main(args)
  File "/miniconda/envs/NGSpeciesID/bin/NGSpeciesID", line 142, in main
    centers_polished = consensus.polish_sequences(centers_filtered, args)
  File "/miniconda/envs/NGSpeciesID/lib/python3.6/site-packages/modules/consensus.py", line 193, in polish_sequences
    for folder in glob.glob(polishing_pattern):
UnboundLocalError: local variable 'polishing_pattern' referenced before assignment

I was suspecting that maybe I am too strict with the alignment thresholds, so I tried changing them a little bit (not too much) but I keep getting the same error. Really appreciate any clue on what is going on... Thanks! Gabriel

ksahlin commented 2 years ago

Hi @gabyrech,

Looks like you get several large clusters from the clustering step, so the --mapped_threshold and --aligned_threshold are not the problem ( --min_shared 40 is pretty high though, but I trust that you know what your doing:)

It is strange to me that you after the clustering get the outcome:

0 centers formed
0 consensus formed.

The --abundance_ratio (decides the minimum cluster size to form a consensus from) is by default 0.1 out of the total number of reads. Since the total number of reads seems to be 11,761 in your case it means that only clusters larger than 0.1*11,761=1,176 reads will be considered. You can try lowering this parameter. However, since you already have 5 clusters over this threshold, I'm not sure it solves is the problem.
Also, your output seems to be missing this print statement

        print(
            f"Forming draft consensus with abundance_cutoff >= {abundance_cutoff} "
            f"({args.abundance_ratio * 100}% of {len(read_array)} reads)"
        )

Which suggests you have an older version. While there should not be any major updates lately, perhaps it would be good to use the latest version (v0.1.3) and remove the output directory for a fresh rerun.

Finally, with these few reads (~10k), I would consider using only one core --t 1, as the overhead of multiple passes in multiprocessing mode possibly outweighs the speedup in parallelization for such a small dataset. I would say 100k reads would be the rough cutoff for considering multiprocessing. (just guestimating here)

ksahlin commented 2 years ago

Let me know how it goes, as my answer might not tackle the cause of the problem.

gabyrech commented 2 years ago

Hi @ksahlin ! Thanks for your quick and comprehensive response!

First let me explain a little bit more so maybe I can even ask you for advice :-).

About the data: These are targeted ONT sequences with A LOT of repetitive sequences (simple repeats most of them). What I want to do is to obtain as many consensus sequences as possible, but avoiding clustering reads that actually don't came from the same genomic region (which is very hard because they share the repetitive sequence with other genomic regions). This is why I though that using high --min_shared --mapped_threshold and --aligned_threshold will allow me to cluster ONLY those reads that are very very similar, and therefore came from the same region. Do you think I am right?

About your suggestions:

Ok I will try lowering --abundance_ratio and see what happens.
Yes, I am using an older version (v0.1.1.1), I will update to the latest.
Great, thanks for the suggestion!

Any suggestion is very welcome! Thanks! Gabriel

ksahlin commented 2 years ago

Okay, I see! What's the rough error rate of your reads?

We do have IsoCon for a similar purpose if the reads have a relatively low error rate (say <5%). IsoCon assumes that reads are not to different in ends (i.e. roughly full-length over the targeted region).

We have also developed isONcorrect that could correct the reads before clustering (e.g. with IsoCon). isONcorrect are sort of allele (SNP/indel) aware and is in general robust, but it's main purpose is not to preserve SNPs (esp low-abundant mutations) but to reduce errors in reads.

We had one analysis with targeted gene families (but pacbio IsoSeq data) where we ran isONclust first and then ran IsoCon on each cluster individually. It worked for our data because isONclust first separated reads into different gene families, then IsoCon did a more fine-tuned separation of several alleles/transcripts from each gene.

gabyrech commented 2 years ago

oh! that sounds even better! My data consist on ONT reads generated with the flongle and basecalled with Guppy5+, so the error rate should be <5%. I also have a corrected version of these reads (corrected with Canu).

I think your approach (isONclust + IsoCon) using corrected reads might also work in our case, since we can say our data is something like having sequences from different gene families, but with the complexity added that they are full of simple tandem repeats.

I will give it a try and let you know how it goes. Thank so much for your advice! Gabriel

ksahlin / NGSpeciesID

UnboundLocalError: local variable 'polishing_pattern' referenced before assignment #20