wbazant / CORRAL

MIT License
10 stars 5 forks source link

output, reads mapped to ref #10

Open emgiles opened 4 weeks ago

emgiles commented 4 weeks ago

Hi again,

I have another question, is there a way to retain the reads that map to each reference taxon? I have looked at the various options for output-type and do not see the option to output the reads (maybe as fasta or a list of read IDs) mapped to any of the reference taxa.

wbazant commented 3 weeks ago

Have a look through the .work folder, you have the .sam files there. You can then grep out the parts of the sam you are interested in and convert them to fasta with samtools. Just be aware when you're looking at the .sam that CORRAL did multiple alignments, and to marker sequences not to the whole genomes.

If you want to just look at counts per marker you could play with marker_alignments arguments - e.g. ask for no filtering or no aggregation - and you can do that from command line for each indivudual .sam, or modify the CORRAL config - copy previous results somewhere else, modify the argument, and rerun, and you'll have the results in under a minute because Nextflow will skip the previous work.

Similarly if you find something you want to do with the reads of each file, you can actually modify CORRAL's nextflow pipeline once you find where nextflow install installed it - add a Nextflow task and rerun, and it will reuse the previous work.

emgiles commented 3 weeks ago

Hi! thanks a lot, this is helpful. What is the difference between the various subdirectories in the work folder? I.e. I see sam files in subdirectories 40, 8c, a4, e4. From a quick look, the .sams in these different folders all look identical.

wbazant commented 3 weeks ago

Nextflow makes that directory structure based on the hash of the input, so it knows what to reuse. There will be files in each folder whose names start with a . (so you need ls -a to see them) defining the job.

From a quick look, the .sams in these different folders all look identical.

They'll start identical, with the header describing the reference sequence. If they're literally all identical then it's bad news - no alignments - but maybe look at the sizes, compare the two biggest ones, and see what's at the beginning and end of each file?

emgiles commented 3 weeks ago

Thanks a lot. After taking a closer look, only one subdirectory has a .sam file that has anything in it, 8c/*.sam is 35M. So I guess this is the one I should use. It is strange to me that the other subdirectories have an alignmentStats.txt file but this subdirectory does not.

emgiles commented 2 weeks ago

Hi, I have another question! How does marker alignments deal with paired end reads? For example, if I extract the sequence identifiers from the .sam alignment of eukDB to paired-end sequences, this will give me sequences in both fastq files that aligned to the eukDB?

wbazant commented 2 weeks ago

if I extract the sequence identifiers from the .sam alignment of eukDB to paired-end sequences, this will give me sequences in both fastq files that aligned to the eukDB? Try it to see but I think yes, with a tiny exception of alignments where only one mate of the pair aligns to the reference!

CORRAL default only modifies the EukDetect config of bowtie2 to add the clustering, so we do bowtie2 --omit-sec-seq --no-discordant --no-unal -k10. So, --no-discordant is on, and --no-mixed is off, see doc for definition, and in particular you might occasionally get just one mate aligning.

The pipeline also has a read filter, there's some extra detail relating to the exception above and a tiny departure from EukDetect, I remember making the issue https://github.com/allind/EukDetect/issues/18 and experimenting with a non-default alignment filter that replicates EukDetect exactly.

I'm not sure if that's helpful or relates to what you're trying to do? I'm thinking if you want to actually do the thing you're asking about, or other custom operations involving the alignments, you might find pysam very useful. marker_alignments is mostly interested in properties of alignments and not their sequences, but you can see how it computes coverage and alignment identity here for an example.

emgiles commented 3 days ago

Hi again. Thanks, this has been helpful. I have rerun CORRAL after merging the paired ends, but now I have discovered something else that I don't understand. If I run CORRAL with only one sample (called in in.tsv), I get a .sam alignment that is vastly different in size (41G) than if I run with multiple samples (each .sam is approx 3G). Then, when I extract the reads in the alignments of course I get way different % of reads mapped to the database. Is there something I'm not following with summarizeAlignments when running multiple samples? See parameters:

REF_PATH="/full_path_to_db/fish/"

INPUT_TSV="./in.tsv"

RESOURCE_CONF="./resource_configuration.conf"

nextflow pull wbazant/CORRAL -r main

nextflow run wbazant/CORRAL -r main \ --inputPath $INPUT_TSV \ --resultDir ./results_sept2024 \ --downloadMethod local \ --unpackMethod bz2 \ --libraryLayout single \ --refdb ${REF_PATH}/fish_database \ --summarizeAlignmentsCommand marker_alignments --min-read-query-length 60 --min-taxon-num-markers 2 --min-taxon-num-reads 2 --min-taxon-better-marker-cluster-averages-ratio 1.01 --threshold-avg-match-identity-to-call-known-taxon 0.97 --threshold-num-taxa-to-call-unknown-taxon 1 --threshold-num-markers-to-call-unknown-taxon 4 --threshold-num-reads-to-call-unknown-taxon 8 --output-type taxon_all\ -c ./$RESOURCE_CONF \ -with-trace -resume

wbazant commented 2 days ago

So you started with paired end reads, you took the ones that aligned, merged them, and aligned again? I've not heard of this way of using paired end reads, and it seems like the merging can potentially lose information and introduce spurious alignments.

If I run CORRAL with only one sample (called in in.tsv), I get a .sam alignment that is vastly different in size (41G) than if I run with multiple samples (each .sam is approx 3G).

The samples shouldn't influence one another but maybe the older run has reused different parameters. You can investigate this by looking in the command files just next to the .sam files - they start with a . so go to the directory of each .sam file and do ls -a - and you'll see what exactly happened. What were the contents of in.tsv in the two runs?

Is there something I'm not following with summarizeAlignments when running multiple samples?

You don't need to change the summarizeAlignmentsCommand when running with a different number of samples, but it can be something to experiment with if you want to understand the alignments better. E.g. if you set --resultDir ./results_sept2024_markers and set --output-type marker_all you'll get information at a more granular level.

emgiles commented 2 days ago

Hi there No, for this run I merged paired end reads (bbmerge), removed adapters and quality trimmed (trim galore). Then aligned to custom database.

  1. When one sample is in in.tsv (see below), resulting .sam in work file is 41G. 10-AK-BT-LGE10775 /full_path/10-AK-BT-LGE10775_merged_trimmed.fq.gz

  2. When five samples are in in.tsv, resulting .sams are 2-4G each. cat in.tsv 8-SK-BT-LGE10779 /full_path/8-SK-BT-LGE10779_merged_trimmed.fq.gz 8-AK-HZ-LGE10790 /full_path/8-AK-HZ-LGE10790_merged_trimmed.fq.gz 8-SK-HZ-LGE10789 /full_path/8-SK-HZ-LGE10789_merged_trimmed.fq.gz 10-AK-BT-LGE10775 /full_path/10-AK-BT-LGE10775_merged_trimmed.fq.gz 10-M-ZH-LGE10783 /full_path/10-M-ZH-LGE10783_merged_trimmed.fq.gz

I also just aligned the single sample (1) to the custom database with bowtie (without CORRAL) and get a .sam of 41G. So, not sure what is going on with the runs with multiple samples in in.tsv. I have used the exact same settings (as shown in last comment above) for all runs.