show read fate - headers from fastq

motu-tool / mOTUs

motus - a tool for marker gene-based OTU (mOTU) profiling

GNU General Public License v3.0

143 stars 24 forks source link

show read fate - headers from fastq #48

Open handibles opened 4 years ago

handibles commented 4 years ago

Hey devs,

Thanks for all the work on this. I'm checking the fate of the fastq reads moving through mOTUs (i.e., which specific reads are assigned, which are not etc.), so interested in the headers from the fastq input.

These headers make up the bottom 99% of the BAM/SAM file and can be obtained as below - but could this header output be set as a flag during the pipeline? Or perhaps in the wiki, outline the general format & composition of a BAM file w.r.t to the work mOTUs is doing? Could help the next person.

## get yr headers

# generate motus profile and save out the BAM file
motus profile -s test1.fastq -I test1.bam

# convert binary BAM to scanable SAM, all reads
samtools view -h test1.bam > test1.SAM

      # compare the start of the SAM file-    
      head -100 test1.SAM

      # and the end of the file - 
      tail -100 test1.SAM

# using samtools to get all MATCHED reads
samtools view -F 4 test1.bam > test1_matched.SAM

# or using samtools to get all UNMAPPED reads    -    edit :: WON'T WORK, unmapped not recorded, see below
samtools view -f 1 test1.bam > test1_lonely.SAM

AlessioMilanese commented 4 years ago

Hi @handibles,

Thanks for your interest in mOTUs!

If I understand correctly, you are suggesting how to improve the tool/wiki.

These headers make up the bottom 99% of the BAM/SAM file

This depends on how many reads there are in the fasta file. With few reads, then yes the majority is header.

but could this header output be set as a flag during the pipeline?

It would be possible to do it, but this is a standard format (BAM or SAM). Printing only the reads would make the file invalid. It's more clean to save a correct SAM/BAM file and then visualise it with samtools (as you showed in your example).

Or perhaps in the wiki, outline the general format & composition of a BAM file w.r.t to the work mOTUs is doing? Could help the next person.

Sure, I added a comment at the end of this wiki page: https://github.com/motu-tool/mOTUs_v2/wiki/Profile-one-sample

handibles commented 4 years ago

Ha, nice!

It would be possible to do it, but this is a standard format (BAM or SAM). Printing only the reads would make the file invalid.

Agreed, the last thing we (I) need is invalid files being output into the workflow. However, sometimes it is useful to be able to (e.g.) easily export all the reads that weren't assigned to a separate fastq etc., for later consideration etc.

Anyone who is interested in scrutinising their reads further could try web-searching for samtools or e.g. see the samtool & SAM format guide for info on how to interact with investigating where reads ended up.

As you say - suggestions. Thanks again for all the diligence.

handibles commented 4 years ago

Apologies for the dread return.

Realised after some more digging that those commands don't return the outputs I expected/described. To clarify:

samtools view -f 4 test1.sam should return all reads not aligned from a given BAM/SAM file - however the mOTUs BAM/SAM file output through BWA only reports reads that successfully aligned to references. I know that I could get the unaligned by subtracting the aligned from the total reads, but is there somewhere else in the output that I could check?
samtools view -F 4 test1.sam(note the capital 'F') does return all read alignments as expected, but as it shows multiple mOTU matches for each read, I'm not sure how to determine which of these matches were ultimately selected for mOTUs' final count output.

My end goal is to determine what reference each successfully aligned read was ultimately assigned to - is there a different portion of the output that I should be looking at?

AlessioMilanese commented 4 years ago

So, we parse the result from bwa. Here is a brief description of what happen in motus map_tax (it's the first step, i.e. mapping and filtering the reads) [check bin/runBWA.py]:

we map reads (for and rev are map separately)
we add information of the lane and type of read. For example, if a read header is @read1, then the read is changed to something like read1.lane0.single
we filter reads (basically at least 75 nucleotide aligning with 97% identity, but there are some other filters)
save the file

These reads are then used to understand to which marker gene cluster they belong with motus calc_mgc (second step, check bin/map_genes_to_mOTUs.py). Note that here we select the best read from the assignments done by bwa.

Moreover, if there are paired-end data, it will select the pair with the best score. Hence, not always the read with the best score is selected. If, for example, read1_for map to MGC1 with score 75 amd MGC2 with score 65; and read2_rev map to MGC1 with score 73 and to MGC2 with score 75. Then, we will select for read1 (both for and rev) to be mapping to MGC1 (score 75+73) and not MGC2 (score 65+75). But, if you look at the reads alone, then you would chose read1_rev to MGC2.

Also, we choose all reads that map correctly, and then we distribute proportionally the multiple mappers. For example, if read1 map with score 75 to MGC1 and MGC2, then we cannot decide to who to assign it just by looking at the SAM file.

AlessioMilanese commented 4 years ago

It is not so easy to understand exactly what happen of the reads, because at certain point we abstract from the single reads and we have just read counts for MGCs, and do some counts/split on that.

That said, all the reads produce by motus map_tax or motus profile -I are used in the profiler. In 90% of the cases the best match is the one used (unless there are multiple mappers/forward and reverse decisions).

If you try to run: motus map_tax -s test1.fq | samtools view

where test1.fq.gz is attached:

test1.fq.gz

There are 3 reads (@a, @b and @unmapped_read), where @unmapped_read does not pass the filter. The result of the previous call is:

a.lane0.single  0   metaMG0002471.COG0012   197 3   177M    *   0   0   GAAGCAGCTAATTATCCTTTTGCGACAATTGATCCAAATGTCGGGCGGGTAGAAGTTCCGGATGCTCGTTTGGATAAATTAACAGAACTCATCAAACCACAGAAGAAAGTCCCAACAACATTCGAATTTACTGACATTGCTGGAATTGTGAAAGGTGCTTCAAAAGGAGAAGGACTA   hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh   NM:i:0  MD:Z:177    AS:i:177    XS:i:172
a.lane0.single  256 metaMG0002483.COG0012   197 0   177M    *   0   0   *   *   NM:i:1  MD:Z:14C162 AS:i:172
a.lane0.single  272 metaMG0002468.COG0012   959 0   177M    *   0   0   *   *   NM:i:1  MD:Z:162G14 AS:i:172
a.lane0.single  256 refMG0036809.COG0012    182 0   177M    *   0   0   *   *   NM:i:1  MD:Z:14C162 AS:i:172
a.lane0.single  272 metaMG0002487.COG0012   959 0   177M    *   0   0   *   *   NM:i:1  MD:Z:162G14 AS:i:172
a.lane0.single  256 refMG0030819.COG0012    182 0   177M    *   0   0   *   *   NM:i:4  MD:Z:107G2T8C5G51   AS:i:157
b.lane0.single  0   metaMG0002471.COG0012   817 0   158M    *   0   0   AGCTACAGAAAATGCTGAAGTTGTTATCATTTCAGCGCGTGCAGAGGAAGAAATTTCTGAATTGGACGATGAAGATAAGTCAGAATTTCTGGAAGCTATCGGCTTGACAGAATCAGGTGTGGATAAACTGACCAGAGCAGCTTATCATCTATTGGGAC  hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh  NM:i:0  MD:Z:158    AS:i:158    XS:i:158
b.lane0.single  272 metaMG0002468.COG0012   358 0   158M    *   0   0   *   *   NM:i:0  MD:Z:158    AS:i:158
b.lane0.single  256 refMG0036809.COG0012    802 0   158M    *   0   0   *   *   NM:i:0  MD:Z:158    AS:i:158
b.lane0.single  272 metaMG0002479.COG0012   358 0   158M    *   0   0   *   *   NM:i:0  MD:Z:158    AS:i:158
b.lane0.single  256 refMG0022381.COG0012    802 0   158M    *   0   0   *   *   NM:i:1  MD:Z:150G7  AS:i:153
b.lane0.single  256 metaMG0002483.COG0012   817 0   158M    *   0   0   *   *   NM:i:1  MD:Z:150G7  AS:i:153
b.lane0.single  272 metaMG0002477.COG0012   358 0   158M    *   0   0   *   *   NM:i:1  MD:Z:122A35 AS:i:153
b.lane0.single  272 metaMG0002487.COG0012   358 0   158M    *   0   0   *   *   NM:i:2  MD:Z:106C25C25  AS:i:148
b.lane0.single  272 metaMG0002492.COG0012   358 0   158M    *   0   0   *   *   NM:i:3  MD:Z:106C23A1C25    AS:i:143

I would like to make you notice that:

all read in the result (there are 2 reads) pass the filter and are used to calculate the relative abundances
read @unmapped_read is not in the file because it doesn't pass the filter
@a map uniquely, in fact it has 177M (check CIGAR strings) and NM:i:0 (edit distance to the reference). All other reads have a higher value for NM (four with NM:i:1 and one with NM:i:4)
@b map to four genes equally well (158M and NM:i:0), so it's not sure to which it will be assigned. Maybe it's a multiple mapper, but It can even be that all genes are in the same MGC, and then it's not ambiguous.

Cantalapiedra commented 2 years ago

Hi @AlessioMilanese ,

Sorry to comment in an old post, but I don't see the need to ask this separately. I am inspecting some cases in which I have many reads mapping to genes from a single MGC, but then such MGC is not counted (it is not present in the output of calc_mgc). My only guess so far is that it has something to do with multiple mappers, because among those mappings there are good quality reads, and good enough alignments (both length and identity).

If I understand correctly from some of your answers (in this one and other issues) multiple mappers are counted, distributing the counts among the different genes/MGCs. However, looking at the code, it seems that there are cases where multiple mappers are completely ignored.

map_genes_to_mOTUs.py, line 956 "#if non of the hit references have a unique mapper and more this insert should be shared by more than 3 references --> ignore it"

However, from the code and the comments alone I cannot fully understand what is happening there.

So, my question is: are there cases where multiple mapper reads are discarded? If yes, which are those cases?

Thank you.

Best, Carlos

AlessioMilanese commented 2 years ago

Hi Carlos,

can you send me one or two of these reads?

Best, alessio

AlessioMilanese commented 2 years ago

But basically, If there are more than 3 multiple mappings and there are no unique mapper, then the multiple mappers are ignored.

The idea is that if you have a read that map equally well to E. coli, E. albertii and E. fergusonii, but you don't have any unique mapper to any of these species; then it's highly probable that it is an error.

And even if you would like to count it, you would have to split the read in three and count 1/3 for each species. So you would have really low read count for these 3 species which would be removed in further analyses.

In case you have many reads that are all more than 3 multiple mappers, but there is no single mapper for that species; then something fishy is happening.

Cantalapiedra commented 2 years ago

Hi Carlos,

can you send me one or two of these reads?

Best, alessio

I uploaded all the reads (actually, the SAM mappings; 74022) mapping to the genes (17) from a single MGC (ref_mOTU_v3_03673) which is not appearing after calc_mgc.

reads_example_motus_no_mgc.sam.gz

Cantalapiedra commented 2 years ago

But basically, If there are more than 3 multiple mappings and there are no unique mapper, then the multiple mappers are ignored.

The idea is that if you have a read that map equally well to E. coli, E. albertii and E. fergusonii, but you don't have any unique mapper to any of these species; then it's highly probable that it is an error.

And even if you would like to count it, you would have to split the read in three and count 1/3 for each species. So you would have really low read count for these 3 species which would be removed in further analyses.

In case you have many reads that are all more than 3 multiple mappers, but there is no single mapper for that species; then something fishy is happening.

Thank you very much for the explanation.

EDIT: ignore the next sentences below. I checked the same read with all the mappings (not only the uploaded ones to the single MGC), and it seems to be mapping to genes from other MGCs, so I guess it is a multiple mapper also.

However, after inspecting the mappings I uploaded I am not completely sure this is the case.

For example, the read "ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1" seems to be mapping uniquely to "refMG0012003.COG0172". It is true that its pair is mapping with flag 272 to the same gene, so I am not sure if this is enough to classify it as multiple mapper...

If you could give me a clue of what is happening here it would be great.

Thank you @AlessioMilanese .

Best, Carlos

Cantalapiedra commented 2 years ago

Hi again,

From the file I uploaded before, I extracted the list of reads, and from this list I obtained all the mappings from those reads (i.e. those mapping to genes from any MGC).

reads_example_motus_no_mgc.all_mappings.sam.gz

I counted how many times I see each read, and none of them seem to be found only once (ranging from 3 to 54 mappings). So yes, this could be the case where no unique mapping is found. Now I have to wonder why is this happening xD

Best, Carlos

AlessioMilanese commented 2 years ago

I'm wondering if the problem is another one. I create a small fastq file (paired end) with the read you mentioned (ST-K00119:198:HHCNHBBXY:7:1101:13017:33897). And I run mOTUs with:

motus profile -f test1.fq -r test2.fq -I test.bam -M test.MGC -o test.motus

where I save the genes (-I), the MGCs (-M) and the mOTUs (-o).

I see then that it is counted as a multiple mapper (from the motus log):

   Total number of inserts: 1
     - Unique mappers: 0
     - Multiple mappers: 1
     - Ignored multiple mapper without unique hit: 0

Now if we look at what genes it maps to (samtools view test.bam):

read1.2.lane0/1 0   refMG0012003.COG0172    451 0   150M    *   0   0   AACCGAAAACGTTGAAGTCCGCCGCTGGGGCACACCTCGCGAATTTGATTTTGAAATCAAGGATCACGTCGATGTCGGCGGTCCCTTAGGCCTCGACTTTGACGTCGGCGCTAAGCTTGCCGGAACACGCTTCTGCTTCATGAAAGGCCA  AAFFFJJJJFJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJAFJJJAJJJJAJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJ-<AFJJJFJJJJJJJJJJJJJJJAFJF  NM:i:0  MD:Z:150    AS:i:150    XS:i:150
read1.2.lane0/1 256 metaMG0004572.COG0172   451 0   150M    *   0   0   *   *   NM:i:0  MD:Z:150    AS:i:150
read1.2.lane0/1 256 metaMG0004633.COG0172   451 0   150M    *   0   0   *   *   NM:i:2  MD:Z:84G2G62    AS:i:140
read1.2.lane0/1 272 metaMG0004593.COG0172   903 0   150M    *   0   0   *   *   NM:i:0  MD:Z:150    AS:i:150
read1.2.lane0/2 0   metaMG0004633.COG0172   683 0   149M    *   0   0   CCGTACATTGCTAATCAGGAAACCCTGATCGGTACCGGCCAGCTGCCTAAATTCGAAGAAGACTTGTTTGCAGCTAAGAAGGGCGGCCAGGAAGGCGAAGGCGAAATCTTCTACCTCATCCCGACTTCTGAAGTTACGCTCACAAACTC   JFJJJFJJFAAFF<FJFF<AF<FFJJA-7JJJJJJFJ<<FAA-JJJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJ<JJJJJJJJJJJJJJJFJJJJJJFFJAJ7FJFJFJJJJFJJJJFJFJFFJJJJJJAJAFJJJJJFFA-A   NM:i:0  MD:Z:149    AS:i:149    XS:i:149
read1.2.lane0/2 256 GUT_GENOME022162.COG0172    683 0   149M    *   0   0   *   *   NM:i:2  MD:Z:44T62T41   AS:i:139
read1.2.lane0/2 256 GUT_GENOME025181.COG0172    683 0   149M    *   0   0   *   *   NM:i:3  MD:Z:44T62T14A26    AS:i:134
read1.2.lane0/2 256 GUT_GENOME246333.COG0172    683 0   149M    *   0   0   *   *   NM:i:3  MD:Z:44T62T14A26    AS:i:134
read1.2.lane0/2 256 metaMG0004572.COG0172   683 0   149M    *   0   0   *   *   NM:i:0  MD:Z:149    AS:i:149
read1.2.lane0/2 256 refMG0012003.COG0172    683 0   149M    *   0   0   *   *   NM:i:0  MD:Z:149    AS:i:149
read1.2.lane0/2 272 metaMG0004577.COG0172   672 0   149M    *   0   0   *   *   NM:i:2  MD:Z:41A62A44   AS:i:139
read1.2.lane0/2 272 metaMG0004593.COG0172   672 0   149M    *   0   0   *   *   NM:i:0  MD:Z:149    AS:i:149

If I check test.MGC I see:

COG0172.mOTU.v2.0002266     0.5000000000
COG0172.mOTU.v2b.0000252    0.5000000000

It reports the MGC because even if there are not unique mappers (I put only one read), it maps to only 2 MGCs (if it would be more than 3 then it would not be reported).

But then there are no mOTUs reported:

   Warning: the relative abundance is 0 for all the mOTUs

     Number of ref-mOTUs:  0
     Number of meta-mOTUs: 0
     Number of ext-mOTUs:  0

Because there is only one MGC detected. It needs at least 3 MGCs from one mOTUs (default -g 3) to report a mOTUs.

So changing the command to -g 1 would make this mOTU appear. Call:

motus profile -f test1.fq -r test2.fq -g 1 -o test.motu

And we can check with cat test.motu | grep -v "0.0000000000":

Acinetobacter sp. [ref_mOTU_v3_03673]   0.5000000000
Burkholderiales species incertae sedis [meta_mOTU_v3_12785] 0.5000000000

Cantalapiedra commented 2 years ago

Hi @AlessioMilanese ,

Thank you very much for testing this.

When I run it with all the reads it is not reported neither with -g 1 or -g 3 (obviously, since no MGC is reported after all).

So I guess (let me know if I am wrong), that what you suggest is that the problem is this one "It reports the MGC because even if there are not unique mappers (I put only one read), it maps to only 2 MGCs (if it would be more than 3 then it would not be reported)." So the problem is that I have multiple mapping reads to genes without unique mappers, and that those genes belong to more than 2 MGCs.

If this is the case, I guess I am just losing this information, because these reads are ambiguous (map to conserved regions of the genes?), and they are impossible to assign with certain confidence. I know this is difficult to address (maybe impossible), but in this case it is not only one or more Species which are not detected, but an Order which is not present when these mappings are discarded.

I wonder whether these mappings should/could be assigned to the Order (assuming that all the mapped genes belong to such Order), even when no Species is assigned.

Thank you very much once more, for your time and patience.

Best, Carlos

Cantalapiedra commented 2 years ago

I checked whether the reads map to genes only from a single order. To do it fine I guess I should keep only the best mappings (discard those with lower identity, etc), but to make things easier I included all the hits. I obtained 2 orders though (Burkholderiales and Pseudomonadales, both from class Betaproteobacteria if I not mistaken).

I was wondering also now... if I had a sample with less sequencing depth, then in some cases (like this example we are discussing) I would detect things just because there would be no enough depth (just by chance) to have reads mapping to more than 2 MGCs?

edit: sorry, I am thinking now the last paragraph makes no sense, because we are considering individual reads here edit2: sorry again jaja but if they were individual reads, then why when you use a single read the MGC counts are reported, but when I use all the reads they are not. Then it must be that all reads together are considered for the MGC counts?

Cantalapiedra commented 2 years ago

I checked my mappings for the same read, and I have the same results as you, so I wonder why I don't get any MGC count nor detect mOTU ref_mOTU_v3_03673 (meta_mOTU_v3_12785 is detected though):

ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1      0       refMG0012003.COG0172    451     0       150M    *       0       0       AACCGAAAACGTTGAAGTCCGCCGCTGGGGCACACCTCGCGAATTTGATTTTGAAATCAAGGATCACGTCGATGTCGGCGGTCCCTTAGGCCTCGACTTTGACGTCGGCGCTAAGCTTGCCGGAACACGCTTCTGCTTCATGAAAGGCCA    AAFFFJJJJFJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJAFJJJAJJJJAJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJ-<AFJJJFJJJJJJJJJJJJJJJAFJF  NM:i:0  MD:Z:150        AS:i:150        XS:i:150
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1      256     metaMG0004572.COG0172   451     0       150M    *       0       0       *       *       NM:i:0  MD:Z:150        AS:i:150
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1      256     metaMG0004633.COG0172   451     0       150M    *       0       0       *       *       NM:i:2  MD:Z:84G2G62    AS:i:140
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/1      272     metaMG0004593.COG0172   903     0       150M    *       0       0       *       *       NM:i:0  MD:Z:150        AS:i:150
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      16      metaMG0004572.COG0172   683     0       149M    *       0       0       CCGTACATTGCTAATCAGGAAACCCTGATCGGTACCGGCCAGCTGCCTAAATTCGAAGAAGACTTGTTTGCAGCTAAGAAGGGCGGCCAGGAAGGCGAAGGCGAAATCTTCTACCTCATCCCGACTTCTGAAGTTACGCTCACAAACTC     JFJJJFJJFAAFF<FJFF<AF<FFJJA-7JJJJJJFJ<<FAA-JJJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJ<JJJJJJJJJJJJJJJFJJJJJJFFJAJ7FJFJFJJJJFJJJJFJFJFFJJJJJJAJAFJJJJJFFA-A   NM:i:0  MD:Z:149        AS:i:149        XS:i:149
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      256     metaMG0004577.COG0172   672     0       149M    *       0       0       *       *       NM:i:2  MD:Z:41A62A44   AS:i:139
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      256     metaMG0004593.COG0172   672     0       149M    *       0       0       *       *       NM:i:0  MD:Z:149        AS:i:149
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     GUT_GENOME022162.COG0172        683     0       149M    *       0       0       *       *       NM:i:2  MD:Z:44T62T41   AS:i:139
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     GUT_GENOME025181.COG0172        683     0       149M    *       0       0       *       *       NM:i:3  MD:Z:44T62T14A26        AS:i:134
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     GUT_GENOME246333.COG0172        683     0       149M    *       0       0       *       *       NM:i:3  MD:Z:44T62T14A26        AS:i:134
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     metaMG0004633.COG0172   683     0       149M    *       0       0       *       *       NM:i:0  MD:Z:149        AS:i:149
ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0/2      272     refMG0012003.COG0172    683     0       149M    *       0       0       *       *       NM:i:0  MD:Z:149        AS:i:149

AlessioMilanese commented 2 years ago

If I understand correctly, the MGC table is empty? So if you run motus profile -I test.MGC [...], then test.MGC is empty?

It seems there is a problem in the way the code is running. Can you try to run motus profile --test; the result should be:

 ------------------------------------------------------------------------------
|                               TEST mOTU profiler                             |
 ------------------------------------------------------------------------------

1-- run setup.py: done

2-- Tools and versions:
- python:    correct
- bwa:       correct
- samtools:  correct
- metaSNV:   correct

3-- Taxonomy profiling test:
- Run motus (-v 1, only error messages):
- end motus call

Check resulting profile: correct

Cantalapiedra commented 2 years ago

Hi @AlessioMilanese ,

No, the MGC file is not empty. It is just missing the results that you showed for the single read you tested. The result of motus profile --test is exactly as the one you showed.

Actually, I ran again motus calc_mgc -i SAM -o COUNTS using as SAM the file I uploaded above (which includes all mappings of the reads which map to genes from mOTU ref_mOTU_v3_03673), and these are the results:

# map_tax unknown | gene database: unknown | 100 | calc_mgc 3.0.1 -y insert.scaled_counts -l 75
unnamed sample
COG0495.mOTU.v2.0001725 2496.6341950236147
COG0215.mOTU.v2.0001782 2818.302496956346
COG0016.mOTU.v2.0001928 2486.4341914358342
COG0533.mOTU.v2.0001812 1470.2779572060317
COG0012.mOTU.v2.0002033 2134.7448621374724
COG0018.mOTU.v2.0001793 5478.124453140666
COG0541.mOTU.v2.0002154 2306.6405502276
COG0018.ext_mOTU_v3_003165      678.7739440996514
COG0172.mOTU.v2.0002266 1273.9701453371777
COG0552.mOTU.v2.0001822 1205.097204435369

If I did everything right, none of those MGCs are from mOTU ref_mOTU_v3_03673, which includes the next MGCs:

COG0012.mOTU.v2b.0000253
COG0016.mOTU.v2b.0000239
COG0018.mOTU.v2b.0000243
COG0172.mOTU.v2b.0000252
COG0215.mOTU.v2b.0003523
COG0495.mOTU.v2b.0000246
COG0533.mOTU.v2b.0000247
COG0541.mOTU.v2b.0000300
COG0552.mOTU.v2b.0000348

When I run, as you did, only the mappings from the single read above (ST-K00119:198:HHCNHBBXY:7:1101:13017:33897.lane0), I get the next results

# map_tax unknown | gene database: unknown | 100 | calc_mgc 3.0.1 -y insert.scaled_counts -l 75
unnamed sample
COG0172.mOTU.v2.0002266 0.5
COG0172.mOTU.v2b.0000252        0.5

COG0172.mOTU.v2b.0000252 belongs to mOTU ref_mOTU_v3_03673. Why this result is only present when I use a single read, and not when the same read mappings are along with mappings from other reads?

Maybe I am doing some error which is leading to this confusion, I don't know.

Thank you once more.

Best, Carlos