large genome changes - Githubissues

rwhetten commented 1 year ago

Thanks for a very interesting tool! I'm working with a very large (>20 Gb) draft genome assembly, so the default index size of 4 Gb for minimap2 requires 6 passes to complete alignments of reads to the genome. Would it create any problem if I edit the gapless.sh script to add the option -I 24G to the minimap2 command lines so it can complete the alignments in a single pass?

schmeing commented 1 year ago

The python code uses np.int32 to read in the files from minimap2, so for these large draft genomes you need to replace these with np.int64 or int. Then I hope it works. Please, let me know if it does not.

rwhetten commented 1 year ago

I am by no means proficient in python, so please correct me if I misunderstood the code. A search grep -n "np.int32" gapless.py returned three lines of code: 217: return pd.read_csv(file_name, sep='\t', header=None, usecols=range(12), names=['q_name','q_len','q_start','q_end','strand','t_name','t_len','t_start','t_end','matches','alignment_length','mapq'], dtype={'q_name':object, 'q_len':np.int32, 'q_start':np.int32, 'q_end':np.int32, 'strand':str, 't_name':object, 't_len':np.int32, 't_start':np.int32, 't_end':np.int32, 'matches':np.int32, 'alignment_length':np.int32, 'mapq':np.int16}) 9168: for reads in pd.read_csv(all_vs_all_mapping_file, sep='\t', header=None, usecols=range(9), names=['q_name','q_len','q_start','q_end','strand','t_name','t_len','t_start','t_end'], dtype={'q_name':object, 'q_len':np.int32, 'q_start':np.int32, 'q_end':np.int32, 'strand':str, 't_name':object, 't_len':np.int32, 't_start':np.int32, 't_end':np.int32}, chunksize=read_in_chunks): 9563: for reads in pd.read_csv(all_vs_all_mapping_file, sep='\t', header=None, usecols=range(9), names=['q_name','q_len','q_start','q_end','strand','t_name','t_len','t_start','t_end'], dtype={'q_name':object, 'q_len':np.int32, 'q_start':np.int32, 'q_end':np.int32, 'strand':str, 't_name':object, 't_len':np.int32, 't_start':np.int32, 't_end':np.int32}, chunksize=read_in_chunks): The two PAF files produced by minimap2 before running the gapless.py script are reads aligned to the draft assembly, and reads aligned to each other. As long as none of the reads or the scaffolds are long enough to exceed the limit of the np.int32 variable type (~2 Gbp as I understand the documentation), it seems to me there should not be a problem. In principle, if the program works to merge contigs and scaffolds to the size of complete chromosomes, then in the final passes of the job some scaffolds could exceed 2 Gbp. At this point I am still waiting on the minimap2 step, so it is too soon to tell. Building scaffolds >2 Gbp is a problem I will be delighted to encounter if it happens - thanks for the pointer to how to fix it.

rwhetten commented 1 year ago

I used sed 's/np.int32/np.int64/g' gapless.py > gapless64.py to change all occurrences of np.int32 in gapless.py to np.int64, and modified gapless.sh to call gapless64.py. I also modified gapless.sh to create and save an index of the draft genome and of the long reads using -I 24G so minimap2 can map reads in a single pass rather than 6 passes. The mapping phase completed successfully and created a PAF file, but the program died in the scaffold phase. I should note that I created a virtual environment called gapless_env and installed the dependencies of the gapless program in that environment so they would be separate from the system tools, and activated the environment before invoking the gapless shell script. The gapless_scaffold.log file contains the following messages.

0:00:06.709484 Reading in original assembly
0:01:36.569799 Loading repeats
2:05:21.963119 Filtering mappings
Traceback (most recent call last):
  File "/media/hd03/software/gapless_env/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3803, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'y'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/hd03/software/gapless/gapless64.py", line 13327, in <module>
    main(sys.argv[1:])
  File "/media/hd03/software/gapless/gapless64.py", line 13156, in main
    GaplessScaffold(args[0], args[1], args[2], min_mapq, min_mapping_length, min_length_contig_break, prefix, stats)
  File "/media/hd03/software/gapless/gapless64.py", line 9069, in GaplessScaffold
    mappings, cov_counts, cov_probs, read_names, read_len = ReadMappings(mapping_file, contig_ids, min_mapq, min_mapping_length, keep_all_subreads, alignment_precision, num_read_len_groups, pdf)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/hd03/software/gapless/gapless64.py", line 427, in ReadMappings
    cov_probs = GetCoverageProbabilities(cov_counts, pdf)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/hd03/software/gapless/gapless64.py", line 386, in GetCoverageProbabilities
    PlotXY(pdf, "Coverage", "% Bins (Size: {})".format(bsize), probs.loc[selection2, 'count'], probs.loc[selection2, 'nbins']/probs.loc[selection2, 'nbin_sum']*100, linex=x_values, liney=nbinom.pmf(x_values,opt_par[0],opt_par[1])*100)
  File "/media/hd03/software/gapless/gapless64.py", line 285, in PlotXY
    sns.lineplot(x=linex, y=liney, color='red', linewidth=2.5)
  File "/media/hd03/software/gapless_env/lib/python3.11/site-packages/seaborn/relational.py", line 645, in lineplot
    p.plot(ax, kwargs)
  File "/media/hd03/software/gapless_env/lib/python3.11/site-packages/seaborn/relational.py", line 459, in plot
    lines = ax.plot(sub_data["x"], sub_data["y"], **kws)
                                   ~~~~~~~~^^^^^
  File "/media/hd03/software/gapless_env/lib/python3.11/site-packages/pandas/core/frame.py", line 3805, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/hd03/software/gapless_env/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    raise KeyError(key) from err
KeyError: 'y'

schmeing commented 1 year ago

The first thing would be to check if it is purely an error during plotting. You can remove the plotting by removing -s pass${i}/gapless_stats.pdf from the gapless.py scaffold command.

The next would be to check what goes wrong: In the GetCoverageProbabilities(cov_counts, pdf) we pass two parameters linex=x_values, liney=nbinom.pmf(x_values,opt_par[0],opt_par[1])*100 to the plotting function that seem to have an issue. Would be good to see them. My guess is that the fit has an issue. That would be the two opt_par values from this call: minimize(CovChi2, [mean, 1/2], args=(pd.DataFrame({'x':probs.loc[selection, 'count'], 'y':probs.loc[selection, 'nbin_cumsum']/probs.loc[selection, 'nbin_sum']})), method='Nelder-Mead').x

rwhetten commented 1 year ago

I added a new line after line 380 of gapless.py (which starts with opt_par = minimize(CovChi2, [mean, 1/2]...) to print the values of the opt_par list, e.g. print("The opt_par values are",opt_par), and ran the gapless.py scaffold command without the -s pass${I}/gapless_stats.pdf option. The gapless_scaffold.log output file contains the following.

0:00:08.828392 Reading in original assembly
0:01:19.492932 Loading repeats
1:16:33.189363 Filtering mappings
The opt_par values are [0.89590919 0.06012658]
The opt_par values are [0.75814105 0.05858334]
The opt_par values are [0.59402949 0.04910503]
The opt_par values are [0.46944524 0.04454478]
The opt_par values are [0.33122449 0.03410489]
The opt_par values are [ 0.13681261 -0.02344248]
The opt_par values are [1.36575799 0.67843674]
The opt_par values are [0.78013647 0.68114021]
The opt_par values are [0.30425958 0.68656148]
1:38:51.930177 Search for possible break points
Traceback (most recent call last):
  File "/home/FCAM/rwhetten/tools/gapless/gapless64.py", line 13328, in <module>
    main(sys.argv[1:])
  File "/home/FCAM/rwhetten/tools/gapless/gapless64.py", line 13157, in main
    GaplessScaffold(args[0], args[1], args[2], min_mapq, min_mapping_length, min_length_contig_break, prefix, stats)
  File "/home/FCAM/rwhetten/tools/gapless/gapless64.py", line 9079, in GaplessScaffold
    break_groups, repeat_removal, spurious_break_indexes, non_informative_mappings, unconnected_breaks = FindBreakPoints(mappings, contigs, covered_regions, repeats, max_dist_contig_end, min_mapping_length, min_length_contig_break, max_break_point_distance, min_num_reads, min_extension, merge_block_length, org_scaffold_trust, cov_probs, prob_factor, allow_same_contig_breaks, prematurity_threshold, pdf)
  File "/home/FCAM/rwhetten/tools/gapless/gapless64.py", line 956, in FindBreakPoints
    repeat_removal = IdentifyBreakinsRepeats(break_points, mappings, repeats, min_num_reads, max_dist_contig_end, min_mapping_length, min_length_contig_break, merge_block_length)
  File "/home/FCAM/rwhetten/tools/gapless/gapless64.py", line 703, in IdentifyBreakinsRepeats
    bmap = inner_repeats.merge(bmap, on=['contig_id','block'], how='inner')
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/frame.py", line 10080, in merge
    return merge(
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 124, in merge
    return op.get_result(copy=copy)
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 775, in get_result
    result = self._reindex_and_concat(
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 737, in _reindex_and_concat
    lmgr = left._mgr.reindex_indexer(
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 742, in reindex_indexer
    new_blocks = [
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 743, in <listcomp>
    blk.take_nd(
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 878, in take_nd
    new_values = algos.take_nd(
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 117, in take_nd
    return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 158, in _take_nd_ndarray
    out = np.empty(out_shape, dtype=dtype)
numpy.core._exceptions.MemoryError: Unable to allocate 349. GiB for an array with shape (3, 15627947335) and data type int64

The job was running on a node with 500 Gb RAM allocated, and the SLURM seff command reports that the maximum utilized was 470 Gb. I don't know if the reported failure to allocate 349 GiB for an array means it reached the limit of available RAM, or if it failed to allocate memory for other reasons, but I can try it again with more RAM allocated.

rwhetten commented 1 year ago

I tried another approach, inserting a print command after line 385 in gapless.py to report values of the linex and liney variables, i.e. print("The values of linex and liney are",x_values,nbinom.pmf(x_values,opt_par[0],opt_par[1])*100). The output was:

0:00:07.265559 Reading in original assembly
0:00:37.373331 Loading repeats
0:29:00.976459 Filtering mappings
The values of linex and liney are [0 1 2 3 4 5 6 7 8 9] [8.05664966 6.78403121 6.0442823  5.48375181 5.01991069 4.61985885
 4.26675404 3.95057643 3.66473008 3.40454587]
The values of linex and liney are [0 1 2 3 4 5 6 7] [11.63587376  8.30483273  6.87284399  5.94858376  5.26148751  4.71365432
  4.25863733  3.87063079]
The values of linex and liney are [0 1 2 3 4 5] [16.69130231  9.42824321  7.14545163  5.87510846  5.01961126  4.38557366]
The values of linex and liney are [0 1 2 3 4] [23.21047743 10.41068617  7.30824513  5.74779925  4.76334433]
The values of linex and liney are [0 1 2 3] [32.66126121 10.4492558   6.71794789  5.04230881]
The values of linex and liney are [0 1] [nan nan]
Traceback (most recent call last):
  File "/home/FCAM/rwhetten/tools/gapless_env/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3803, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'y'

followed by several additional python errors.

schmeing commented 1 year ago

Thx for the detailed information. The negative number should not be there as a result of the fit. The program cannot successfully run through like this with out without plotting. I will fix this bug as soon as possible, but currently the VPN is down and I cannot access the servers to test anything.

Regarding the crash, it tried to allocate additional 349 GiB, so if it already used 470, you would need 820Gb. I assume your genome is not just large, but also very repetitive. I will also try to reduce the memory usage at this step.

schmeing commented 1 year ago

I am still facing issues with the VPN and cannot run anything, but thinking about it some more, I realized that maybe it is not the fit that goes wrong, but the input data might already be corrupted. To rule out this possibility, it would be good if you could provide me with the mean and pd.DataFrame({'x':probs.loc[selection, 'count'], 'y':probs.loc[selection, 'nbin_cumsum']/probs.loc[selection, 'nbin_sum']}) for the fit that resulted in the negative number. Thank you.

rwhetten commented 1 year ago

I added line 381 to gapless.py (64-bit version): print("The opt_par values are",opt_par,"; mean value is",mean,"; pd.DataFrame values are ",pd.DataFrame({'x':probs.loc[selection, 'count'], 'y':probs.loc[selection, 'nbin_cumsum']/probs.loc[selection, 'nbin_sum']})) The output was

The opt_par values are [-0.09721827 -0.00902136] ; mean value is 22.299499514677134 ; 
pd.DataFrame values are
     x         y
0    0  0.057560
1    1  0.080566
2    2  0.114372
3    3  0.155254
4    4  0.202092
5    5  0.249341
6    6  0.298561
7    7  0.345313
8    8  0.391173
9    9  0.434149
10  10  0.474733
11  11  0.512951

schmeing commented 1 year ago

Thank you! The numbers in the dataframe seem to be good. Thus, I introduced bounds to the optimization and this should fix the issue. It requires Scipy 1.7.0 now, so please check. I also added a --largeGenome parameter to gapless.py scaffold, to replace your manual changes to it. However, I did not add anything to gapless.sh yet, so there you need to keep your manual version. Next, I will have a look what I can do to reduce the memory footprint for many repeats.

rwhetten commented 1 year ago

My gapless conda env (which just provides the python dependencies, not the gapless.py or gapless.sh code) has scipy 1.9.3 - is the requirement for scipy = 1.7.0, or scipy >= 1.7.0? Would it help reduce the memory footprint if I could provide the program a list of contigs likely to be in the same chromosome? I'm working on a strategy to use genetic linkage information to group a subset of contigs (probably in the range of 10% to 20%) into linkage groups. The larger contigs are more likely to be included, so the proportion of the total assembly partitioned into these putative chromosomal groups might be larger, perhaps 30% or more.

schmeing commented 1 year ago

Sorry, for being unclear scipy >= 1.7.0 is required. So, it should work for you.

Providing the information to the program which contigs are allowed to be connected is currently not possible and also requires quite a bit of changes at multiple stages of the program. However, what you can do is separate the contigs into groups and create one "genome" per group and run the repeat mapping with minimap2 on each of the groups separately. Then, split the read mapping to the whole genome also by group and run gapless.py scaffold for each group individually. This should reduce the number of repeats and thereby also the memory requirements.

Nevertheless, I will try to make the join more efficient by splitting it in multiple smaller joins, but I am still fighting to get my VPN access to work again.

rwhetten commented 1 year ago

I understand the virtues of splitting the giant genome into 10 or more smaller "subgenomes", but I want to be sure I understand how to implement that with Gapless. For the repeat mapping, each set of contigs in a subgenome is mapped against itself to generate a separate gapless_repeats.paf file. For the read mapping, however, is it also essential to have only the reads that make up the contigs in each subgenome used for alignments to that subgenome? Or would it be OK to align the full set of reads to each subgenome? I don't have a list of the reads that were assembled into each contig, so if the goal is to split the reads into subsets that correspond to the subgenomes, I will have to develop a strategy to do that from the gapless_reads.paf file from alignment of all reads to all contigs. A follow-up question - assuming the subgenomes can be processed to yield a few large fragments, can those large fragments then be combined with additional contigs in an iterative process? I could repeat-mask the larger scaffolds using a repeat library build from the fragmented version of the assembly, if that would reduce the memory required to deal with all the repeat copies.

schmeing commented 1 year ago

With the latest commit the memory requirement of the step that crashed for you should be considerably reduced.

To split the job into subgenomes, I would recommend to align the full set of genes to the full genome and then split the paf by the scaffolds that went into each of the subgenomes into one alignment file per subgenome. Thus, the genes will still map to the best place in the whole genome.

To mask repeats you need to remove them from the repeat mapping file, but I hope that now the memory requirements dropped enough that this is not necessary anymore.

schmeing / gapless

large genome changes #4