mhalushka / miRge3.0

Comprehensive analysis of small RNA sequencing data
MIT License
28 stars 11 forks source link

Ambiguous alignment handling #74

Closed Glfrey closed 1 year ago

Glfrey commented 1 year ago

Hi @arunhpatil,

I don't expect this particular pull request to be merged but figure this is best way to discuss what I've implemented so far, and why.

To handle the ambiguous alignment problem I've had to make one major change to miRges operations, which is using dictionary of dictionaries to store the output of the bowtie alignment, before processing into the data frame. This dictionary has the following format: {sequence:{aligned_mol_name_0: full sam alignment information, ..., aligned_mol_name_n: full sam alignment information}}.

The reason for this is because iterating through the alignment line by line and storing the results in the data frame directly means we loose the alignment information for the aligned first molecule unless we want to iterate through the output multiple times (which is an option). I thought carefully about whether a dictionary was the best option given that they can become quite hefty to store in RAM, which adds to the computational burden for users, but for now I think it's the most straightforward for me to implement. We can swap this strategy out for another, more RAM-friendly, alternative if preferred bearing in mind the unavoidable trade-off of time and/or I/O/diskspace costs.

Once the dictionary is generated, I then iterate through the keys (sequences) and for single mappers, simply populate the data frame as was being done previously. For multi mappers I check which alignment has the highest scoring MAPQ and enter that entry into the data frame. For instances where there's a tie for the high scoring alignment, the first alignment that was iterated through is recorded in the data frame with the addition of the "" character I mentioned. I know you said you wanted to hold off that, I included it only so I could double check everything was running as I intended (check that sequences with "" have an entry in the secondary alignment file) and we can remove this as you wish. Multi-mapping alignments are stored in an iteration-specific file e.g. mature_tRNA_secondary_matches.csv and the number of multi-mappers, and the portion of the unique reads this makes up is reported.

On that note, I have downloaded some public data from the ENA, study accession ERP112400 and ran samples ERR2938231, ERR2938232, ERR2938234, ERR2938235 through this new code. I have found that similarly to my data, only tRNAs are reporting multiple equally valid alignments and that proportion of multi-mapping unique reads is very high for mature tRNAs (from 62% (public)-78% (my data)) and lower for primary tRNAs (4.8% (public) - 17.6 (my data)). I'm not highly experienced with tRNAs but I wonder if this impacts the files output when using the 'trf flag (i.e. tRF.Counts.csv, tRF.RP100K.csv, tRFs.aligned.report.tsv)? Incidently, I noticed when using this flag the directory tRFs.samples.tmp isn't being deleted, which I guess it should given the ".tmp" extension?

This doesn't help our quest to identify how a multi-mapper might be handled for miRNAs. My guess is that multi mapping miRNAs are most likely to map to sequences from the same family given their highly similar sequence which means the reported counts in miR.Counts.csv and miR.RPM.csv would still be valid. If we wanted to be extra cautious we could implement a subroutine to make sure all the reported alignments for a unique read do, indeed, fall into set of miRNAs with merged names?

mhalushka commented 1 year ago

Thank you for this insightful work. This is very interesting. I am not at all surprised about the tRNA data having multi-mapping reads. That is expected due to the tRNA families scattered throughout the genome. We had spent a considerable amount of time coming up with the best strategy to report collapsed tRNA results and report them as combined "Comb" in the tRF.Counts and tRF.RP100K. It would be interesting to learn how these final numbers change due to multimapping or if they are relatively stable. Do you have any sense of the multimapping impact on miRNAs? I again suspect that since we collapse similar miRNAs together, this is negligible, but perhaps I am mistaken. In building miRge, we were most interested in counting the species in the RNA-seq data. Thus we worked hard to capture isomiRs. If one is more interested in knowing from which genetic loci the read came from, miRge would not be the best approach and an alignment tool that does not allow mismatches and that aligns directly to the genome would be a better approach (but at the cost of a lot of isomiR data).

Glfrey commented 1 year ago

Hi @mhalushka,

For tRNAs my suspicion is that, despite the multi-mapping, the counts are likely to remain stable as it seems they map to highly similar sequences that you've probably already considering in your collapsing. I'm happy to send you a link to the output of a mirge run on the public data if you'd like to analyse the impact the multi-mapping reads might have on tRNA counts?

For miRNAs I have yet to identify any multi-mappers, although I have so far only been able to test it on 6 samples of sequencing data (4 public, 2 of my own) so I wouldn't say it's in anyway representative. I do think it's possible to assess the risk of inter-family multi-mappers by determining if any families of miRNAs overlap in sequence content given the alignment parameters. For example, discarding the first base and final 2 bases and allowing an edit distance of 2 (which aligns with bowties parameters of -5 1 -3 2 -v 2 -f --norc --best -S --threads , do we return any miRNAs that are not within the same family which is collapsed for count purposes? If so, we have the potential for a multi-mapping problem not yet dealt with via family-based collapsing of counts, if not, great. How to go about about actually testing that without an exhaustive all-against-all search I haven't figured out yet.

I completely agree with miRges aim to capture miRNA species and isomiRs and I think it's done brilliantly. My concern with multi-mappers (and also collapsed families, something I raised before) stems from a validation viewpoint. For multi mappers, it's a problem because I haven't actually been able to definitively annotate an miRNA with its name, instead I have a set of possibilities. From a validation point of view that's not impossible to work with, but it is important to know. For collapsed miRNAs (not part of this pull request or current issue but something I've raised before), it would be good to know which of the possibilities (collapsed names) are actually present. I have numerous instances in my data where a read is annotated with the collapsed names but actually there is only one member of the family present in the data, which makes validation in the lab much more straightforward. Given that nobody has raised the issue with collapsed families and validation before I figure it's something I should implement internally, and so I have. However I do think the multi-mapping issue is potentially much more important given how usual I've found it for people to treat bioinformatics results with absolute certainty, when if there's multi mappers involved (namely those with equally valid alignments), that certainty doesn't exist and they should be aware of them before drawing any conclusions downstream.

Either way, it might not be an issue if we can determine there's no inter-family overlap given the mismatch parameters!

arunhpatil commented 1 year ago

@Glfrey,

I will have a look at this over the weekend and get back to you.

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

Thank you, but please don't feel obligated to work on the weekend on my account! There's no hurry on my end, I'm making the changes I need locally.

mhalushka commented 1 year ago

@Glfrey Those are good points. Often, these questions come down to how much of the data is affected. We know miRge is not perfect at assigning very derivative isomiRs. We had iterated the parameters (-5 1 -3 2 -v 2 -f --norc --best -S --threads) through multiple rounds of testing to get the best balance of miRNA counts with some clear false positives and some clear false negatives. I suspect the multi-mapping would be <<0.1% of miRNA reads, after the miRNAs are collapsed together. Globally, this would not be meaningful, but possibly it could impact a specific miRNA count if the total counts were low. Please keep us informed of any big changes you uncover.

Glfrey commented 1 year ago

Hi both,

I had to re-sync my fork with yours as I'd made additional changes to my fork which were being reflected in this pull request which caused this pull request to close. I've now made two separate repositories to keep a private version of miRge3.0 with the changes for my own purposes and public one which contains the multi-alignment handling.

arunhpatil commented 1 year ago

Hi @Glfrey,

I will clone the version you have (forked version) on my computer and see the difference. I did go through the code and I think we now have lot more print statements, I will check this again (we may not keep all of it). However, is this the final commit?

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

Of course, keep only what you need. I implemented the print statements just to draw a users attention to the presence multi mappers (and so I knew how many I had).

If you could give me until end of day tomorrow to just run the code with -nmir and -miRec parameters so I know I haven't inadvertently caused problems with anything there that would be great. I've tested all of the others, but I need my AWS server to test those due to issues with my Mac. I'll also proofread my code for any dead variables I stopped using with various tests (I've spotted one already). I'll drop you a message when that's all done.

Glfrey commented 1 year ago

Hi @arunhpatil,

My data is still running, taking longer than I expected. Hopefully it'll be finished tomorrow but I'm observing no issues yet. If you need to crack on with tests feel free to do so, I can always prune the dead variables after. I don't know if it matters to you but flake8 identified a few unused variables and imports from before my additions. I won't prune those but I can share them with you if you're interested.

arunhpatil commented 1 year ago

@Glfrey,

Yes, please do share. Also, take your time (No rush).

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

My test finally finished and I didn't observe any issues. I've removed my dead variables and pasted the ones ID'd by flake8 below so I think we're good to go!

manifoldAlign.py:7:1: F401 'concurrent.futures' imported but unused
manifoldAlign.py:10:1: F401 'mirge.libs.miRgeEssential.UID' imported but unused
manifoldAlign.py:57:9: F841 local variable 'bwtErr' is assigned to but never used
arunhpatil commented 1 year ago

Hi @Glfrey,

Thanks for letting me know. I have cloned your commits locally, I will test them and will either suggest the changes on your fork or will edit the PR on parent branch. Please wait before I can merge the PR. (I will take little longer time to review, and as these commits are useful to you, and no immideate changes are required from us, I request you to hold on this PR).

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

Sure thing, let me know if you need anything else from me.

mhalushka commented 1 year ago

Thank you both for working on this!

Glfrey commented 1 year ago

Hi @arunhpatil,

Just letting you know I'll be on vacation and largely unreachable until June 19th after today, so if you have any questions for me please excuse the delay, I'll get back to you ASAP.

arunhpatil commented 1 year ago

Hi @Glfrey,

Sure, thanks for letting us know. Enjoy your vacation.

Thank you, Arun.

arunhpatil commented 1 year ago

Hi @Glfrey,

I agree with @mhalushka regarding this comment . Besides, I went through your changes, and here are my comments and I have to say, unfortunately, the changes can't be accepted and I will revert the PR. Please go through the following comments:

  1. There are 5 new files generated:
    • exact_miRNA.csv: This is not required and is not part of the "ambigous alignment", well if the user is interested in generating the same file, then the user can use the command grep "1,hsa" mapped.csv > exact_miRNA.csv
    • isomiR_alignment_info.csv: Not part of the "ambigous alignment" and I believe is 12 fold slower compared to the original alignment step (explained below)
    • isomir_miRNA.csv: This is not required and is not part of the "ambigous alignment", well if the user is interested in generating the same file, then the user can use the command grep ",,hsa" mapped.csv > isomir_miRNA.csv
    • pd_align.csv: is same as mapped.csv and is not part of the "ambigous alignment"
    • mature_tRNA_secondary_matches.tsv: This is the only step where I observed multiple mapping and this is where I would suggest to keep the top best alignment and that should be simple (as suggested here and this is where I agree with Marc's comment as well)
      1. I have created two environments (original - unchanged version and glfrey - with recent changes) and I tested at several iterations and at different sequence depths (low to high: 581,450 to 37,373,347 range, n= 12) and the main pitfall is the time taken to run both versions (n = 11) the original: 1078.303 second(s) and the glfrey version: 12096.9864 second(s). The second important thing is there are no secondary alignment in any other RNA species but tRNAs. I have appended the run log for both versions below.

General command run in bash script:

miRge3.0 -s SRR772403.fastq.gz,SRR772404.fastq.gz,SRR772405.fastq.gz,SRR772406.fastq.gz,SRR8487219.fastq.gz,SRR12021974.fastq.gz,SRR12021975.fastq.gz,SRR5127206.fastq.gz,SRR5139121.fastq.gz,SRR5689184.fastq.gz,SRR649562.fastq.gz  -a illumina -lib /mnt/d/Halushka_lab/Arun/GTF_Repeats_miRge2to3/miRge3_Lib/revised_hsa -on human -db mirbase -o ambiguous_output_dir_original

miRge3.0 -a AGATCGGAAGAGCACACGTCTGAACTCC -s SRR9856179.fastq.gz -lib /mnt/d/Halushka_lab/Arun/GTF_Repeats_miRge2to3/miRge3_Lib/revised_hsa -on human -db mirbase -o ambiguous_output_dir_original

Environment: (glfrey)

bowtie version: 1.3.1
cutadapt version: 4.4
Samtools version: 1.17
Collecting and validating input files...

miRge3.0 will process 11 out of 11 input file(s).

Cutadapt finished for file SRR772403 in 1.224 second(s)
Collapsing finished for file SRR772403 in 0.0115 second(s)

Cutadapt finished for file SRR772404 in 5.1876 second(s)
Collapsing finished for file SRR772404 in 0.2321 second(s)

Cutadapt finished for file SRR772405 in 7.8799 second(s)
Collapsing finished for file SRR772405 in 0.8461 second(s)

Cutadapt finished for file SRR772406 in 2.4071 second(s)
Collapsing finished for file SRR772406 in 0.8675 second(s)

Cutadapt finished for file SRR8487219 in 83.0853 second(s)
Collapsing finished for file SRR8487219 in 5.0017 second(s)

Cutadapt finished for file SRR12021974 in 56.0665 second(s)
Collapsing finished for file SRR12021974 in 7.0654 second(s)

Cutadapt finished for file SRR12021975 in 50.7537 second(s)
Collapsing finished for file SRR12021975 in 8.4717 second(s)

Cutadapt finished for file SRR5127206 in 33.1898 second(s)
Collapsing finished for file SRR5127206 in 14.0683 second(s)

Cutadapt finished for file SRR5139121 in 28.4602 second(s)
Collapsing finished for file SRR5139121 in 19.0013 second(s)

Cutadapt finished for file SRR5689184 in 33.4478 second(s)
Collapsing finished for file SRR5689184 in 40.4994 second(s)

Cutadapt finished for file SRR649562 in 84.9423 second(s)
Collapsing finished for file SRR649562 in 42.4472 second(s)

Matrix creation finished in 7.3851 second(s)

Data pre-processing completed in 542.7753 second(s)

Alignment in progress ...
Alignment starting for exact miRNA...
Number of molecules with valid alternate alignments for exact miRNA:0
Alignment starting for hairpin miRNA...
Number of molecules with valid alternate alignments for hairpin miRNA:0
Alignment starting for mature tRNA...
Number of molecules with valid alternate alignments for mature tRNA:79273
 This accounts for 55.5% of all aligned molecules
 See ambiguous_output_dir/miRge.2023-07-15_16-37-46/mature_tRNA_secondary_matches.csv for alignment details
Alignment starting for primary tRNA...
Number of molecules with valid alternate alignments for primary tRNA:161
 This accounts for 8.3% of all aligned molecules
 See ambiguous_output_dir/miRge.2023-07-15_16-37-46/primary_tRNA_secondary_matches.csv for alignment details
Alignment starting for snoRNA...
Number of molecules with valid alternate alignments for snoRNA:0
Alignment starting for rRNA...
Number of molecules with valid alternate alignments for rRNA:0
Alignment starting for ncrna others...
Number of molecules with valid alternate alignments for ncrna others:0
Alignment starting for mRNA...
Number of molecules with valid alternate alignments for mRNA:0
Alignment starting for isomiR miRNA...

Number of molecules with valid alternate alignments for isomiR miRNA:0
Alignment completed in 11457.8935 second(s)

Summarizing and tabulating results...
Summary completed in 73.1995 second(s)

The path to ourput directory: /mnt/d/Halushka_lab/Arun/datasets/ambiguous_output_dir/miRge.2023-07-15_16-37-46

The analysis completed in 12096.9864 second(s)

/home/apatil8/miniconda3/envs/glfrey/lib/python3.10/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(
bowtie version: 1.3.1
cutadapt version: 4.4
Samtools version: 1.17
Collecting and validating input files...

miRge3.0 will process 1 out of 1 input file(s).

Cutadapt finished for file SRR9856179 in 51.5895 second(s)
Collapsing finished for file SRR9856179 in 0.6636 second(s)

Matrix creation finished in 0.9112 second(s)

Data pre-processing completed in 54.5996 second(s)

Alignment in progress ...
Alignment starting for exact miRNA...
Number of molecules with valid alternate alignments for exact miRNA:0
Alignment starting for hairpin miRNA...
Number of molecules with valid alternate alignments for hairpin miRNA:0
Alignment starting for mature tRNA...
Number of molecules with valid alternate alignments for mature tRNA:13865
 This accounts for 40.3% of all aligned molecules
 See ambiguous_output_dir/miRge.2023-07-15_19-59-25/mature_tRNA_secondary_matches.csv for alignment details
Alignment starting for primary tRNA...
Number of molecules with valid alternate alignments for primary tRNA:9
 This accounts for 3.6% of all aligned molecules
 See ambiguous_output_dir/miRge.2023-07-15_19-59-25/primary_tRNA_secondary_matches.csv for alignment details
Alignment starting for snoRNA...
Number of molecules with valid alternate alignments for snoRNA:0
Alignment starting for rRNA...
Number of molecules with valid alternate alignments for rRNA:0
Alignment starting for ncrna others...
Number of molecules with valid alternate alignments for ncrna others:0
Alignment starting for mRNA...
Number of molecules with valid alternate alignments for mRNA:0
Alignment starting for isomiR miRNA...
Number of molecules with valid alternate alignments for isomiR miRNA:0
Alignment completed in 2927.2198 second(s)

Summarizing and tabulating results...
Summary completed in 7.3135 second(s)

The path to ourput directory: /mnt/d/Halushka_lab/Arun/datasets/ambiguous_output_dir/miRge.2023-07-15_19-59-25

The analysis completed in 2992.7378 second(s)

Environment: (original)

bowtie version: 1.3.1
cutadapt version: 4.4
Samtools version: 1.17
Collecting and validating input files...

miRge3.0 will process 11 out of 11 input file(s).

Cutadapt finished for file SRR772403 in 1.1892 second(s)
Collapsing finished for file SRR772403 in 0.012 second(s)

Cutadapt finished for file SRR772404 in 4.9313 second(s)
Collapsing finished for file SRR772404 in 0.2166 second(s)

Cutadapt finished for file SRR772405 in 7.4472 second(s)
Collapsing finished for file SRR772405 in 0.7839 second(s)

Cutadapt finished for file SRR772406 in 2.2182 second(s)
Collapsing finished for file SRR772406 in 0.7795 second(s)

Cutadapt finished for file SRR8487219 in 78.8076 second(s)
Collapsing finished for file SRR8487219 in 5.002 second(s)

Cutadapt finished for file SRR12021974 in 55.3117 second(s)
Collapsing finished for file SRR12021974 in 7.401 second(s)

Cutadapt finished for file SRR12021975 in 49.1254 second(s)
Collapsing finished for file SRR12021975 in 8.4378 second(s)

Cutadapt finished for file SRR5127206 in 31.8337 second(s)
Collapsing finished for file SRR5127206 in 14.8211 second(s)

Cutadapt finished for file SRR5139121 in 27.8011 second(s)
Collapsing finished for file SRR5139121 in 19.7841 second(s)

Cutadapt finished for file SRR5689184 in 31.0373 second(s)
Collapsing finished for file SRR5689184 in 40.7376 second(s)

Cutadapt finished for file SRR649562 in 82.215 second(s)
Collapsing finished for file SRR649562 in 48.348 second(s)

Matrix creation finished in 6.9877 second(s)

Data pre-processing completed in 535.1613 second(s)

Alignment in progress ...
Alignment completed in 464.0276 second(s)

Summarizing and tabulating results...
Summary completed in 72.2282 second(s)

The path to ourput directory: /mnt/d/Halushka_lab/Arun/datasets/ambiguous_output_dir_original/miRge.2023-07-16_12-54-47

The analysis completed in 1078.303 second(s)

/home/apatil8/miniconda3/envs/original/lib/python3.10/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(
bowtie version: 1.3.1
cutadapt version: 4.4
Samtools version: 1.17
Collecting and validating input files...

miRge3.0 will process 1 out of 1 input file(s).

Cutadapt finished for file SRR9856179 in 51.3802 second(s)
Collapsing finished for file SRR9856179 in 0.5438 second(s)

Matrix creation finished in 0.8638 second(s)

Data pre-processing completed in 54.3416 second(s)

Alignment in progress ...
Alignment completed in 77.9287 second(s)

Summarizing and tabulating results...
Summary completed in 7.2512 second(s)

The path to ourput directory: /mnt/d/Halushka_lab/Arun/datasets/ambiguous_output_dir_original/miRge.2023-07-16_13-12-49

The analysis completed in 140.932 second(s)

As mentioned earlier, I will revert the PR. However, I appreciate your observation on the "ambiguous alignment" which is very important.

Thank you for your continued interest in miRge3.0.

Thank you, Arun.

Glfrey commented 1 year ago

Hi @arunhpatil,

Regarding the extra files, I noticed that myself on a recent run I did. They’re just files I used for debugging that I forgot to delete and missed on my testing, apologies about that.

As for the increase in time to process, that doesn’t surprise me given the process I used to check for multiple valid alignments but I appreciate that’s undesirable especially with your observation regarding only tRNAs being affected.

I understand your desire to roll back my changes, thanks for working with me anyway.

Gill