hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
139 stars 22 forks source link

twobit sizes do not match #160

Open lpnunez opened 1 month ago

lpnunez commented 1 month ago

Hello,

I am trying to run TOGA to transfer annotations over from my well-annotated reference to several different query genomes. However, while trying to run TOGA on certain genomes I would run into the following error:

Found 365 sequences in /home/lnunez/mendel-nas1/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit Error! 2bit file: /home/lnunez/mendel-nas1/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit; chain_file: /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Natrix_natrix/CM020096/temp/genome_alignment.chain Chromosome: WNA01000062.1; Sizes don't match! Size in twobit: 7 1766; size in chain: 71276 Traceback (most recent call last): File "/home/lnunez/mendel-nas1/TOGA/toga.py", line 1600, in main() File "/home/lnunez/mendel-nas1/TOGA/toga.py", line 1595, in main toga_manager = Toga(args) File "/home/lnunez/mendel-nas1/TOGA/toga.py", line 261, in init self.check_param_files() File "/home/lnunez/mendel-nas1/TOGA/toga.py", line 338, in check_param_files TogaSanityChecker.check_2bit_file_completeness(self.t_2bit, t_chrom_to_size, self.chain_file) File "/mendel-nas1/lnunez/TOGA/modules/toga_sanity_checks.py", line 105, in check_2bit_file_completeness raise ValueError(err) ValueError: Error! 2bit file: /home/lnunez/mendel-nas1/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit; chain_file: /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Natrix_natrix/CM020096/temp/genome_alignment.chain Chromosome: WNA01000062.1; Sizes don't match! Size in twobit: 71766; size in chain: 71276

WNA01000062.1 refers to an unplaced scaffold in the reference, of which there are 347 of them. However, I am only interested in looking at the actual reference chromosomes, of which there are 18. At first, I used the --limit_to_ref_chrom option to limit the runs to these specific chromosomes, like so:

./toga.py "${path_to_chain}"/"${genome}.chain.gz" ${path_to_bed} "${path_to_2bit}"/"${ref}.2bit" ${path_to_2bit}"/"${genome}.2bit" --limit_to_ref_chrom ${chromosome} --kt --pn /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/"${genome}"/"${chromosome}" --nc ${path_to_nextflow_config_dir} --cb 10,100 --cjn 500

However, I still get the same error, despite noting to limit it to the chromosome. Is there a way to bypass this particular step that I am not seeing? I am in a time crunch, so I would greatly prefer it if I did not have to regenerate the input files from the start.

MichaelHiller commented 1 month ago

Hi,

can you pls check why this scaffold different sizes in the 2bit and chain (twobit: 71766; size in chain: 71276). Was a different 2bit used for the chain alignments? This is certainly an issue with your input.

I don't know what the --limit_to_ref_chrom parameter does, but maybe try to remove the problematic scaffolds (if you don't need them) from the 2bit and the chain file. Then it should work.

lpnunez commented 1 month ago

Hello,

Thank you very much for the response. The 2bit and chain files were generated from a HAL file of 61 genomes generated through Cactus. So the scaffolds should be the same length, but it's not clear why they are different lengths. The main chromosome scaffolds are all the same length. I already went ahead and removed the unplaced scaffolds from the fastas pulled from the HAL file and generating new chain files using make_lastz_chains.py to see if it will work.

However, I ran into another issue running TOGA on a genome where the 2bit size did match with the reference. I kept running into an issue with the CESAR jobs crashing and the run erroring out. I checked this issue here where there was a similar issue: (https://github.com/hillerlab/TOGA/issues/146)

Following the suggestions in the issue, I added the --ms option and deactivated the "die_if_sc_1=True" flag in parallel_managers_helpers.py to see if it would help. However, the run still errored out. I then checked the cesar_jobs_crashed.txt file to check which jobs were problematic and ran them directly. When I did so, I got the following error:

/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py XM_032224702.1 10 /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Seminatrix_pygaea/temp/toga_filt_ref_annot.hdf5 /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Seminatrix_pygaea/temp/genome_alignment.bst /home/lnunez/mendel-nas1/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit /home/lnunez/mendel-nas1/WGS/Cactus/Outputs/Diss/twobit/Seminatrix_pygaea.2bit --cesar_binary /home/lnunez/mendel-nas1/TOGA/CESAR2.0/cesar --uhq_flank 50 --temp_dir /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Seminatrix_pygaea/temp/cesar_temp_files --mask_stops --check_loss --alt_frame_del --memlim 10

Traceback (most recent call last):
  File "/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py", line 2661, in <module>
    realign_exons(cmd_args)
  File "/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py", line 2294, in realign_exons
    exon_coordinates, exon_sequences, s_sites, exon_flanks = get_exons(
  File "/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py", line 425, in get_exons
    left_flank = chrom_seq[left_flank_coord[0]: left_flank_coord[1]].upper()
  File "/home/lnunez/mendel-nas1/miniconda3/lib/python3.9/site-packages/twobitreader/__init__.py", line 432, in __getitem__
    return self.get_slice(min_=slice_or_key.start, max_=slice_or_key.stop)
  File "/home/lnunez/mendel-nas1/miniconda3/lib/python3.9/site-packages/twobitreader/__init__.py", line 511, in get_slice
    fourbyte_dna.fromfile(file_handle, blocks_to_read)
ValueError: negative count

So there is an issue with the realign_exons function in CESAR_wrapper.py where it is yielding a negative error. However, I am not sure how to proceed from here. For further background, I am running these jobs through SLURM and allotting 100 GB of memory for each job. These are snake genomes that are each 1.5 GB in file size. I am running TOGA with the following line:

./toga.py "${path_to_chain}"/"${genome}.chain.gz" ${path_to_bed} "${path_to_twobit}"/"${ref}.2bit" "${path_to_twobit}"/"${genome}.2bit" --kt --ms --pn "${out_dir}"/"${genome}" --nc ${path_to_nextflow_config_dir} --cb 10,100 --cjn 500

MichaelHiller commented 1 month ago

Hmm, 100 GB should be enough. Do all jobs error or only particular ones with very big genes (I think TTN needs more mem).

The other thing that could be an issue, though I don't see it in these messages is the .1 in the transcript ID. TOGA projections are named transcript.gene.chainID The XM*.1 may confuse TOGA. @kirilenkobm I think we check at an early stage whether the transcripts contain a dot? Do we?

lpnunez commented 1 month ago

I do not believe the dot in the transcript name would have an issue. I have issued jobs with the same reference that ran smoothly.

Also, I realized that I switched the order of the reference 2bit and the target 2bit in the toga command (incredibly embarrassing on my end, but it would explain why the scaffold sizes were not matching between the 2bit and chain files). Regarding the jobs where the CESAR jobs have been crashing, could this be part of the issue? The target genomes here have the same scaffold sizes as the reference, so if I had switched the reference and target 2bits it would not get caught.

lpnunez commented 1 month ago

I wanted to follow up, now that I have the results from more attempted jobs. So every run that I have attempted have failed with the same error mentioned in my second comment and reprinted here:

/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py XM_032237228.1 5 /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Nerodia_clarkii/temp/toga_filt_ref_annot.hdf5 /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Nerodia_clarkii/temp/genome_alignment.bst /mendel-nas1/lnunez/WGS/Cactus/Outputs/Diss/twobit/Nerodia_clarkii.2bit /mendel-nas1/lnunez/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit --cesar_binary /home/lnunez/mendel-nas1/TOGA/CESAR2.0/cesar --uhq_flank 50 --temp_dir /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Nerodia_clarkii/temp/cesar_temp_files --mask_stops --check_loss --alt_frame_del --memlim 10

CESAR JOB FAILURE   
Traceback (most recent call last):
  File "/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py", line 2661, in <module>
    realign_exons(cmd_args)
  File "/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py", line 2294, in realign_exons
    exon_coordinates, exon_sequences, s_sites, exon_flanks = get_exons(
  File "/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py", line 425, in get_exons
    left_flank = chrom_seq[left_flank_coord[0]: left_flank_coord[1]].upper()
  File "/home/lnunez/mendel-nas1/miniconda3/lib/python3.9/site-packages/twobitreader/__init__.py", line 432, in __getitem__
    return self.get_slice(min_=slice_or_key.start, max_=slice_or_key.stop)
  File "/home/lnunez/mendel-nas1/miniconda3/lib/python3.9/site-packages/twobitreader/__init__.py", line 511, in get_slice
    fourbyte_dna.fromfile(file_handle, blocks_to_read)
ValueError: negative count

Again, the issue seems to lie with the realign_exons() function in the CESAR_wrapper.py script. This same error has repeated in all of my jobs so far and there is no consistency with which transcripts in the genome the jobs fail on. Only 8-12 of the thousands of transcripts seem to be failing, but since it's not consistent across genomes, I can't really go back and remove the problematic transcripts from the inputs.

To provide some more context. All of the input files are derived from a HAL file of 61 genomes, with the exception of the bed file which is derived from the reference genome in NCBI. I previously ran TOGA on genomes derived from a much smaller HAL file of four genomes to test the script and all of those jobs succeeded. However, the chain files derived from that HAL file were noticeably smaller than the ones derived from the 61 genome HAL. For instance, the Nerodia clarkii genome that I present here was part of that smaller alignment and that job finished without an issue as opposed to this one.

Is it possible to simply skip over the problematic transcripts when the CESAR jobs fail? Or is this indicative of a larger problem with my data?

MichaelHiller commented 1 month ago

interesting. Typically, such issues affect many scaffolds and thus many genes. But from what you describe, it seems like only a few small scaffolds are affected.

There must be a way to ignore the crashed CESAR jobs (typically not recommended, but maybe OK here). @kirilenkobm can you pls advise how to continue downstream?

kirilenkobm commented 1 month ago

Hi @lpnunez

That sounds interesting. Could you please grep XM_032237228.1 and any other problematic transcripts from the reference BED file? Maybe I can spot something unusual.

I'd recommend skipping such transcripts unless they are vital for downstream research and affect only a small fraction of the whole dataset.

lpnunez commented 1 month ago

Sure, here are a few transcripts that keep crashing:

CM020106.1 70392312 70459950 XM_032226866.1 0 - 70394527 70459807 255,0,0 17 2377,106,138,71,123,191,111,122,149,188,145,191,124,170,69,264,300 0,2525,4309,4819,7569,10314,11744,12542,12767,15403,16733,20055,22919,23411,29369,41687,67338 CM020099.1 141063852 141200299 XM_032216150.1 0 - 141063852 141153535 255,0,0 23 57,240,204,129,38,196,135,204,168,82,19,72,92,58,53,19,77,50,50,49,72,244,333 0,45437,47414,52764,54679,54719,56771,58372,61032,65269,66843,66864,69483,71042,71102,74811,78295,79660,80778,80830,84345,89551,136114 CM020106.1 70610074 70666620 XM_032226988.1 0 - 70610089 70666482 255,0,0 11 20,100,272,112,89,70,131,90,162,172,348 0,38824,39807,40999,41552,42314,44331,45776,46340,48588,56198 CM020113.1 139601986 139624066 XM_032237228.1 0 + 139602038 139623976 255,0,0 10 112,141,190,144,86,165,105,163,81,194 0,4266,6075,7928,10150,10340,12862,17462,19141,21886 CM020106.1 70646882 70666619 XM_032226987.1 0 - 70647621 70666482 255,0,0 11 924,100,272,112,89,70,131,90,162,172,347 0,2016,2999,4191,4744,5506,7523,8968,9532,11780,19390 CM020103.1 80422091 80429632 XM_032222362.1 0 - 80422160 80429632 255,0,0 3 399,140,1126 0,876,6415 CM020110.1 46982118 47036024 XM_032231700.1 0 + 46982118 47036024 255,0,0 33 48,69,67,80,93,178,134,126,98,141,64,122,178,148,148,139,201,93,95,171,103,110,97,99,183,204,90,114,135,123,59,232,189 0,231,1400,4313,6830,8408,9755,10005,12194,13571,14812,15013,15876,17361,18409,21261,22349,23341,28403,29081,31642,33920,35030,35615,35834,36453,36913,41402,43331,44951,47736,49608,53717 CM020111.1 36957766 37144457 XM_032232823.1 0 + 36957766 37144457 255,0,0 49 287,106,140,92,153,185,104,116,57,90,210,113,113,132,73,118,68,107,63,834,218,130,130,60,107,12,97,161,138,202,165,111,84,6,117,66,92,151,128,97,109,105,107,119,138,115,202,203,540 0,7176,8120,39792,40958,42029,57307,63003,65115,67408,70324,77416,81407,83803,84108,86136,89353,90396,93735,97954,100339,101213,102699,105850,106515,109859,113602,114929,116467,117398,119016,119291,153463,156467,157963,158963,160573,161512,162645,169170,173694,173922,174322,175919,177711,179085,182049,183419,186151

So I went through the list of failed jobs and I found that the transcripts that I list here are the same across most of the ingroup taxa. The outgroup has a different set of transcripts that failed, and there is at least one ingroup taxon that has a different set of problematic transcripts. I decided to remove these transcripts from the bed file and see if it will work for the rest of the taxa (they represent 8 transcripts out of tens of thousands, so I am not too concerned about them).

However, while I may have found consistency across most ingroup taxa, it is a bit irksome to have to run a job and see if it fails or not to figure out which transcripts are problematic or not. I was wondering if it was possible to check if a CESAR job will fail or not beforehand, to save time. Or if it's possible to ignore those jobs and continue with the rest of the run. I would understand if you did not implement an option for something like this, though.

kirilenkobm commented 1 month ago

Thank you a lot @lpnunez

Actually, some filters for problematic transcripts have already been implemented, but the number of potential edge cases is simply vast.

I've tried to find anything suspicious about these particular transcripts so far, but I haven't had any luck.

If it's OK with you, could you please send me the files used in this command?

/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py XM_032237228.1 5 /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Nerodia_clarkii/temp/toga_filt_ref_annot.hdf5 /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Nerodia_clarkii/temp/genome_alignment.bst /mendel-nas1/lnunez/WGS/Cactus/Outputs/Diss/twobit/Nerodia_clarkii.2bit /mendel-nas1/lnunez/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit --cesar_binary /home/lnunez/mendel-nas1/TOGA/CESAR2.0/cesar --uhq_flank 50 --temp_dir /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Nerodia_clarkii/temp/cesar_temp_files --mask_stops --check_loss --alt_frame_del --memlim 10

I can see which line crashes, but I'm not sure what could be causing this condition. Despite running TOGA on hundreds of different genomes, this is something new.

lpnunez commented 1 month ago

Sure, here is a Google Drive link for the input files: https://drive.google.com/drive/folders/1i2AZr0zogw4SVFYqICpNe4MBLPkrZNHe?usp=sharing

The files here are for a different taxon, but the condition is the same as the command you highlighted. Here is the command for these files:

/home/lnunez/mendel-nas1/TOGA/CESAR_wrapper.py XM_032237228.1 5 /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Regina_grahamii/temp/toga_filt_ref_annot.hdf5 /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Regina_grahamii/temp/genome_alignment.bst /mendel-nas1/lnunez/WGS/Cactus/Outputs/Diss/twobit/Regina_grahamii.2bit /mendel-nas1/lnunez/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit --cesar_binary /home/lnunez/mendel-nas1/TOGA/CESAR2.0/cesar --uhq_flank 50 --temp_dir /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Regina_grahamii/temp/cesar_temp_files --mask_stops --check_loss --alt_frame_del --memlim 10

I have been removing the problematic transcripts from the input bed files and it seems to have fixed the issue. But it would be nice to know why these transcripts keep failing, so there could potentially be a better fix. Thanks again for the help.