ksahlin / IsoCon

Derives consensus sequences from a set of long noisy reads by clustering and error correction.
GNU General Public License v3.0
14 stars 1 forks source link

Error: AttributeError: 'DiGraph' object has no attribute 'node' #7

Open nextgenusfs opened 4 years ago

nextgenusfs commented 4 years ago

Hi @ksahlin. Thanks a ton for all of your tools with noisy reads. I'm looking for a solution for de novo clustering of ONT amplicon reads from environmental sequencing, ie fungal rRNA amplicons. The data I'm trying this on is from a mock community of mixed species. The region is the ITS-LSU region of rRNA in fungi -- we typically define species with a 97% pident cutoff with this region. The data has been pre-processed by re-orienting reads into the same direction and finding/trimming both forward and reverse primer sequences.

I've tried isONclust and at first it seemed like it might be working great (and quite fast), but then on further inspection it was a little too liberal on clustering the data that I have access to at the moment, effectively combining too many reads into the same "gene family". I ran a parameter search by varying k and w to see if I could get it to give me the proper results, but essentially never got a set of parameters that could delineate the clusters properly. My goal is to find a method to identify the "centroid", as then it is relatively straightforward to use spoa and racon/medaka for error correction. I tried to clean up the clustering little bit by invoking a "sub clustering" by plotting read lengths (as fungal ITS-LSU sequences are variable in length) and then pulling out "peaks" from the lengths of reads -- this seemed to work okay, but still not quite what I'm looking for.

Based on some of the other issues in your tool repositories, I then tried IsoCon which you had indicated seemed to be a more general approach. IsoCon has a much much longer runtime and then eventually crashed with the error below (note I ran it initially without --prefilter_candidates --min_candidate_support 2 and it crashed with same error).

If you have any other suggestions on an appropriate workflow I'd be grateful to hear your opinions.

Thanks, Jon

$ IsoCon pipeline -fl_reads reads.oriented.proper-primers.yacrd.fastq -outfolder isocon_test2 --verbose --prefilter_candidates --min_candidate_support 8 --nr_cores 7
fl_reads: reads.oriented.proper-primers.yacrd.fastq
outfolder: isocon_test2
ccs: None
nr_cores: 7
verbose: True
neighbor_search_depth: 4294967296
min_exon_diff: 20
min_candidate_support: 8
p_value_threshold: 0.01
min_test_ratio: 5
max_phred_q_trusted: 43
ignore_ends_len: 15
cleanup: False
prefilter_candidates: True
which: pipeline
is_fastq: True

ITERATION: 1

Max transcript length:2694, Min transcript length:806
Non-converged (unique) sequences left: 67501
[0, 964, 1928, 2892, 3856, 4820, 5784, 6748, 7712, 8676, 9640, 10604, 11568, 12532, 13496, 14460, 15424, 16388, 17352, 18316, 19280, 20244, 21208, 22172, 23136, 24100, 25064, 26028, 26992, 27956, 28920, 29884, 30848, 31812, 32776, 33740, 34704, 35668, 36632, 37596, 38560, 39524, 40488, 41452, 42416, 43380, 44344, 45308, 46272, 47236, 48200, 49164, 50128, 51092, 52056, 53020, 53984, 54948, 55912, 56876, 57840, 58804, 59768, 60732, 61696, 62660, 63624, 64588, 65552, 66516, 67480]
query chunks: [964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 21]
processing  0
processing  14500
processing  3000
processing  17500
processing  6000
processing  500
processing  9000
processing  12000
processing  15000
processing  3500
processing  18000
processing  6500
processing  1000
processing  9500
processing  12500
processing  15500
processing  4000
processing  7000
processing  18500
processing  10000
processing  1500
processing  13000
processing  4500
processing  16000
processing  7500
processing  10500
processing  2000
processing  19000
processing  13500
processing  5000
processing  16500
processing  8000
processing  11000
processing  2500
processing  19500
processing  14000
processing  5500
processing  17000
processing  8500
processing  11500
processing  29000
processing  20000
processing  20500
processing  32000
processing  23500
processing  35000
processing  26500
processing  21000
processing  29500
processing  32500
processing  38000
processing  24000
processing  35500
processing  27000
processing  21500
processing  30000
processing  33000
processing  24500
processing  38500
processing  27500
processing  36000
processing  22000
processing  30500
processing  33500
processing  25000
processing  39000
processing  28000
processing  36500
processing  22500
processing  31000
processing  25500
processing  34000
processing  28500
processing  39500
processing  37000
processing  23000
processing  31500
processing  40500
processing  26000
processing  34500
processing  43500
processing  40000
processing  46500
processing  37500
processing  41000
processing  55000
processing  49500
processing  52500
processing  44000
processing  47000
processing  55500
processing  58000
processing  50000
processing  41500
processing  53000
processing  44500
processing  58500
processing  56000
processing  47500
processing  50500
processing  42000
processing  53500
processing  59000
processing  56500
processing  45000
processing  48000
processing  51000
processing  54000
processing  42500
processing  59500
processing  57000
processing  45500
processing  51500
processing  48500
processing  54500
processing  60000
processing  57500
processing  43000
processing  60500
processing  52000
processing  46000
processing  49000
processing  67000
processing  67500
processing  61000
processing  64000
processing  61500
processing  64500
processing  62000
processing  65000
processing  65500
processing  62500
processing  66000
processing  63000
processing  66500
processing  63500
isolated: 0
Number of edges: 76499
Total edit distance: 14654968
Avg ed (ed/edges): 191.57071334265808
Traceback (most recent call last):
  File "/Users/jon/miniconda3/envs/amptk_dev/bin/IsoCon", line 292, in <module>
    run_pipeline(params)
  File "/Users/jon/miniconda3/envs/amptk_dev/bin/IsoCon", line 159, in run_pipeline
    candidate_file, read_partition, to_realign = isocon_get_candidates.find_candidate_transcripts(params.read_file, params)
  File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/isocon_get_candidates.py", line 129, in find_candidate_transcripts
    G_star, graph_partition, M, converged = partitions.partition_strings(S, params)
  File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/partitions.py", line 420, in partition_strings
    G_star, converged = graphs.construct_exact_nearest_neighbor_graph(S, params)
  File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/graphs.py", line 63, in construct_exact_nearest_neighbor_graph
    if G.node[s1]["degree"] > 1:
AttributeError: 'DiGraph' object has no attribute 'node'
ksahlin commented 4 years ago

Hi @nextgenusfs, thank you appreciate it!

Regarding the runtime bug: I actually fixed it today (see #6). So you can either reinstall IsoCon (version 0.3.3) by removing and reinstall from scratch, or it is simply sufficient to downgrade networkx to 2.3 as follows:

pip uninstall networkx
pip install networkx==2.3 

As for the strategy I will get back tomorrow (it's pretty late in my timezone). Here are some general comments on IsoCon for ONT sequencing:

  1. IsoCon does not handle reverse complements. If you have reverse complemented sequences the predictions will contain the sequence and its reverse complement, but maybe that’s easy to post-filter? Another strategy is to identify the primers beforehand and re-orient the reverse complements (to speed up runtime of IsoCon even more).
  2. There will be some redundant consensus due to the different ONT error profile compared to IsoSeq data. Therefore, parameters to specify would be --max_phred_q_trusted 20 (default is 43 for hihger quality CCS reads) and --p_value_threshold 0.00001 (instead of default 0.01). This could however also be post-filtered by simply removing consensus with a p-value larger than e.g., 0.00001 (the p-value is printed to the accession of the consensus sequence)
nextgenusfs commented 4 years ago

Great thanks. I downgraded networkx and I'll give it a re-try right now with your suggested ONT parameters.

The data I'm using is published, but I've already oriented and trimmed primers so I'm only trying to feed IsoCon the "cleaned up" data in hopes of being able to pick cluster centroids.

ksahlin commented 4 years ago

Also, regarding isONclust. you could increase cluster thresholds e.g., --mapped_threshold 0.9 --aligned_threshold 0.7, (and perhaps -k 12 -w 15 if runtime allows it) this will be more stringent (more clusters).

Also note that isONclust has an experimental --consensus parameter that performs what you said: spoa then medaka on each cluster. It may be convenient.