Problem with high percentage (over 70%) of "unmapped" reads

regisjr42 commented 3 years ago

Hello, it's me again.

After finally putting SQMtools to work thanks to your help, I find myself again facing some problems (or at least I think so). As I said on my first topic here, I'm working with metatranscriptomic samples, and they are of marine sediment origin. The whole SqueezeMeta pipeline went smoothly, and now SQMtools was doing fine too but, while following the manual and testing the package's functions I generated the phylum diversity graph and found out that a considerable chunk of the data was labeled as unmapped, and another large portion as unclassified.

Could this have been to some poor configuration during the SqueezeMeta process? Is it somewhat normal in this sort of study (I'm kind of new to this as clearly pointed by my inexperience)? Or maybe the samples themselves are just poorly sequenced (they are from public repositories and not mine)? I'm attaching the generated graph here for reference. I also tried to generate the function heatmap and that also got me basically nothing.

fpusan commented 3 years ago

How did you assemble the transcriptomes? Did you do something similar to what's described in #206?

jtamames commented 3 years ago

Hello

My guess is that your data do not have enough coverage as to assemble transcripts other than high-abundance ones. This means that you will have very biased results. I would reccomend you to make the analysis using sqm_reads.pl or sqm_longreads.pl instead.

Best,

regisjr42 commented 3 years ago

@fpusan I actually did not use rnaSPAdes. I saw that topic and was gearing up to test using rnaSPAdes and see if it changed anything. If I do follow up on this I will get back to you.

@jtamames I will look into this too, this seems like a good first option to see what I can get different. My bad for not looking thoroughly enough in the manual as that fits perfectly the description of my issue. If I want to analyze the output of that afterwards with SQMtools the process will be the same?

fpusan commented 3 years ago

Regarding sqm_reads and SQMtools, you will have to use sqmreads2tables.py instead of sqm2tables.py for creating the tables, and loadSQMlite instead of loadSQM for loading the data into R. The functionality will be a bit more limited (namely you won't be able to make subsets, but you will still be able to make taxonomic and functional plots, as well as Krona charts and pathwaytools maps).

RitaBogorad commented 3 years ago

@regisjr42 can I please ask you what is the percentage of mapped reads you see in your .mappingstat file in the results folder? I have a similar issue where I get ~70% of reads assigned to Unmapped when I visualize my data in SQMTools but the mean percentage of mapped reads in the .mappingstat is 87%

fpusan commented 3 years ago

@RitaBogorad that should not happen. Any chance you can share the project with me?

RitaBogorad commented 3 years ago

@fpusan yes and, thank you, as usual ;) for your help. I can send you the project via wetransfer, what files exactly should I attach? or do you mean the R object?

fpusan commented 3 years ago

Wetransfer is great. Please send the results and intermediate directories, as well as the SqueezeMeta_conf.pl file.

regisjr42 commented 3 years ago

Regarding sqm_reads and SQMtools, you will have to use sqmreads2tables.py instead of sqm2tables.py for creating the tables, and loadSQMlite instead of loadSQM for loading the data into R. The functionality will be a bit more limited (namely you won't be able to make subsets, but you will still be able to make taxonomic and functional plots, as well as Krona charts and pathwaytools maps).

I finally got around to testing some of the stuff, I generated results with 'sqm_reads' but the directory structure of the output seems a bit different from the normal SqueezeMeta output, there is no results folder or anything so when I ran 'sqmreads2tables.py' I didn't really know where to store it so I just created the 'tables' folder right there and tried loading it into R with 'loadSQMlite' but it said the folder did not seem to contain valid SqueezeMeta tables

fpusan commented 3 years ago

That should have worked. Can you walk me through the commands you used, from running sqm_reads to trying to load the tables folder into R?

regisjr42 commented 3 years ago

Sure @fpusan I did as follows:

sqm_reads.pl -p ProjectName -s path/to/EquivFile -f /path/to/fastq -t 12

After that, it created the ProjectName folder which had all the output files, inside it was the .allreads, .funcog, .funkegg, .mcount and .mappingstat and a folder for each group inside my project and their .cogs, .kegg, .tax and etcetera files inside.

So what I did was execute sqm2tables.py

sqm2tables.py /path/to/my/project /path/to/my/project/tables

Since the output folder did not have the 'results' folder, I just put the 'tables' folder like this, maybe that was the problem?

When loading it into R, I did just as before and copied everything in my Project directory just as it was to my personal computer and executed loadSQMlite as follows

ProjectName = loadSQMlite('/path/to/my/project')

And what happened was:

ProjectName = loadSQMlite('/path/to/my/project') Loading taxonomies Error in loadSQMlite("/path/to/my/project") : Directory "/path/to/my/project" does not seem to contain valid SqueezeMeta tables

And that was it, hope you have some insight in what happened here

fpusan commented 3 years ago

Yup. In your case it should be ProjectName = loadSQMlite('/path/to/my/project/tables'), as per the documentation ;)

fpusan commented 3 years ago

The key distinction here (and maybe this is not sufficiently explained) is that loadSQM will load a complete project as generated by SqueezeMeta.pl (including ORF, contigs, and bins information), while loadSQMlite will only load a directory of tables. This directory can come from different sources, either running sqm2tables on a "proper" SqueezeMeta project, or running sqmreads2tables.py in a project created with sqm_reads.pl or sqm_longreads.pl. In that case we load only the tables containing the taxonomic/functional profiles because we either want to save memory (if the project was generated with SqueezeMeta.pl) or because there is no other information (if the project was generated by sqm_reads or sqm_longreads then we don't have contigs, orfs or bins, only taxonomic/functional profiles).

fpusan commented 3 years ago

@RitaBogorad regarding your dataset, I loaded it and the percentage of unmapped reads in the SQMtools plot seems consistent to the one shown in the 10.*.mappingstatfile. You have a lot of Unclassified reads, though, but that's a different issue. Those unclassified reads are mapping back to the assembly, but map to contigs that could not be taxonomically classified.

regisjr42 commented 3 years ago

And it sure worked thanks @fpusan ! I feel like a fool now for not checking that. I was trying to execute it like the normal loadSQM without checking the manual first and that's what happens when you do that.

It somewhat solved the original issue with unmapped reads, now there are just a ton of unclassified reads in their place, I think that's some progress. I will follow up now on running the normal SqueezeMeta but with rnaSPAdes for the assembly if not only for troubleshooting and for comparison.

fpusan commented 3 years ago

This indicates that the problem was not related to the assembly - the Unmapped reads that had not been assembled can also not be classified when we look at them one by one.

RitaBogorad commented 3 years ago

@fpusan thank you for checking out my data. I am a bit surprised to see that for you the numbers match. When I upload the project into SQMTools I get this plot (attached) where ~70% of reads appear to be unmapped

The commands I use:

data = loadSQM('/path/metatranscriptome/2019/', tax_mode = "nofilter", trusted_functions_only = F, engine = "data.frame")

plotTaxonomy(data, rank='superkingdom', count='percent')

Am I interpreting the values wrong?

superkingdom

fpusan commented 3 years ago

I see it now. The issue happens when you use the prokfilter taxonomy mode when loading. Will fix ASAP, use the default for now and it should be fine.

fpusan commented 3 years ago

Ok, I think that the problem is that in the prokfilter mode the script is treating reads mapping to a contig but not contained in any ORF as "Unmapped". Since the tracking of Unmapped reads was introduced in the latest version and prokaryotic genomes are dense (meaning that most reads mapping to a contig map also to an ORF) we had not noticed until now.

You can check that this is not the case for your data.

data = loadSQM('/path/metatranscriptome/2019/', engine = "data.table")
codingDensity = colSums(data$orfs$abund) / colSums(data$contigs$abund)
mean(codingDensity ) # 0.4121446

So only 41% of your assembly (weighted by contig abundance) is coding for something recognizable by our prediction software (prodigal + barrnap + aragorn). Do you expect to have a lot of eukaryotes there? I think they have less coding density, but also their ORFs are harder to predict (we use prodigal, which is tailored for prokaryotic genes).

PD: Anyways I have to fix that and treat those reads as Unclassified rather than Unmapped

RitaBogorad commented 3 years ago

thank you very much for the clarification.

Our experiment indeed involves samples with single-cell eukaryote algae and a bunch of other bacteria. But the last 5 samples (inside the blue rectangle on the figure) are containing only bacterial cells and should not contain any eukaryote reads. The percentage of unclassified is indeed higher in the eukaryote-containing samples but the difference between them and all bacterial controls is not that large.

Would you recommend re-doing the sqm run with --euk flag? Do you think it should increase the number of classified eukaryote reads?

As we are working with a non-model algae organism, I would think that the database is not yet updated with the latest annotations for our species. Do you think that adding an external genome assembly can increase the % of identified reads? For us, in this experiment, it is possible to create a new "chimera" assembly with the bacterial-only metagenome (used for this metatranscriptome) and the full algae genome, sequenced by our colleagues. Do you think that should improve the results?

Similarly, we do have an access to the algal transcriptome but if I supply it for the functional annotation for this metatranscriptome dataset, my bacteria will not be annotated. Is there a way I could add an additional functional database to an already existing one in sqm?

superkingdom2

fpusan commented 3 years ago

Our experiment indeed involves samples with single-cell eukaryote algae and a bunch of other bacteria. But the last 5 samples (inside the blue rectangle on the figure) are containing only bacterial cells and should not contain any eukaryote reads. The percentage of unclassified is indeed higher in the eukaryote-containing samples but the difference between them and all bacterial controls is not that large.

Yeah, you are right. I wonder what would happen if you ran only the last 5 samples in SqueezeMeta. Would the assembly look better? Note that you still have a percentage of eukaryotic contigs in those samples (at least according to SqueezeMeta). Aren't those samples supposed to be bacteria only? What happens if you check some of those contigs manually? (using blastn or blastx online at NCBI's site)

Would you recommend re-doing the sqm run with --euk flag? Do you think it should increase the number of classified eukaryote reads?

You can try, but I don't think it will improve dramatically. You can also add the -D flag, which will try to annotate ORFs using blastx in addition to prodigal etc.

As we are working with a non-model algae organism, I would think that the database is not yet updated with the latest annotations for our species. Do you think that adding an external genome assembly can increase the % of identified reads? For us, in this experiment, it is possible to create a new "chimera" assembly with the bacterial-only metagenome (used for this metatranscriptome) and the full algae genome, sequenced by our colleagues. Do you think that should improve the results?

Yes, that might improve things. At least you won't have to worry about the assembly of the algae.

Similarly, we do have an access to the algal transcriptome but if I supply it for the functional annotation for this metatranscriptome dataset, my bacteria will not be annotated. Is there a way I could add an additional functional database to an already existing one in sqm?

You can esily add an extra functional database when running SqueezeMeta (see the ReadMe, section "6. Using external databases for functional annotation"). However, only predicted genes get functionally annotated, so it will not help for eukaryotic genes not predicted by prodigal/blastx. If you or your colleages already have a GFF/GTF containing the gene/transcript predictions for your algae, there might be a way to hack it into the SqueezeMeta pipeline.

fpusan commented 3 years ago

@RitaBogorad issue is hopefully fixed by 689add5. This is how the barplot looks now (using the nofilter mode as you did in the example above).

fig

If everything goes according to plan (does it ever?), we'll release 1.3.1 by the end of this week with this and other fixes.

RitaBogorad commented 3 years ago

thank you, Fernando, for the update! will be looking forward to the new release!

Meanwhile, I have been trying out the suggestions from the previous post.

Yeah, you are right. I wonder what would happen if you ran only the last 5 samples in SqueezeMeta. Would the assembly look better? Note that you still have a percentage of eukaryotic contigs in those samples (at least according to SqueezeMeta). Aren't those samples supposed to be bacteria only? What happens if you check some of those contigs manually? (using blastn or blastx online at NCBI's site)

Some of the original SqueezeMeta assignments honestly made me giggle - we have there unique to China trees, Peruvian chili papers, and Brazilian plants. For a Belgium-based laboratory of marine biology, those are somewhat unexpected results:) the majority of those suspicious reads BLAST could link with ~96% identity to Arabidopsis. And, unfortunately, those reads are also very abundant so filtering out low counts didn't result in having a more "expected" marine bacteria signature. Would you think that this is happening because our bacteria probably is not very annotated in the reference databases? Or could something go wrong during the analysis?

As we are working with a non-model algae organism, I would think that the database is not yet updated with the latest annotations for our species. Do you think that adding an external genome assembly can increase the % of identified reads? For us, in this experiment, it is possible to create a new "chimera" assembly with the bacterial-only metagenome (used for this metatranscriptome) and the full algae genome, sequenced by our colleagues. Do you think that should improve the results?

Yes, that might improve things. At least you won't have to worry about the assembly of the algae.

I created a metagenome assembly from bacterial and algae only DNA whole-genome Illumina sequencing and mapped the metatranscriptome to this assembly. If I compare the numbers, the percentage of mapped reads of metatranscriptome with metatranscriptome based-assembly is ~87% mapped reads per sample. But for the metagenome-based assembly, the percentage of mapped metatranscriptome reads per sample is ~50%. Especially the bacterial controls have a very low percentage mapped - ~20%. Could that be explained by the fact that the bacterial species that appear in my bacterial metatranscriptome controls had low abundance in the bacterial metagenome and were underrepresented in the assembly?

Those two observations lead me to the conclusion that maybe it is best for such experiments to perform 2 separate sqm runs - 1 for the "host" organism and 1 for the bacteria and then analyze the results as a merged object. Would you agree with this solution?

fpusan commented 3 years ago

Some of the original SqueezeMeta assignments honestly made me giggle - we have there unique to China trees, Peruvian chili papers, and Brazilian plants. For a Belgium-based laboratory of marine biology, those are somewhat unexpected results:) the majority of those suspicious reads BLAST could link with ~96% identity to Arabidopsis.

In general, I wouldn't trust very low level annotations (whether the plant is peruvian, chinese or whatever) and less so for eukaryotes. SqueezeMeta is normally very conservative when assigning taxonomies, but if you are using the nofilter or prokfilter modes then it will be much less strict, particularly for eukaryotes.

In any case, the fact that blastn also marks those as Arabidopsis makes me think that the problem does not come from SqueezeMeta. Blastn and SqueezeMeta both agree on assigning those reads to a plant, so I'd say some "planty" reads are definitely present in those samples, even if they are supposed to be bacteria-only controls. I'd say that cross-contamination is the most likely explanation.

Those "planty" reads might actually belong to your model algae (which I guess would be the most likely source of contamination in your lab). I'm a bit surprised that blastn assigned them to Arabidopsis instead of to some algae. Have you sequenced your model organism? Can you check whether your "weird" contig/reads actually map to your model organism?

In any case we'll take a look at the project you sent us, see if we spot something weird with the eukaryotes and the unclassified reads.

regisjr42 commented 3 years ago

How did you assemble the transcriptomes? Did you do something similar to what's described in #206?

I decided to give rnaSPAdes a shot and see if it improved my results, but I'm facing the same problems as mentioned in the topic. rnaSPAdes is not running properly due to being hard coded to --meta instead of --rna and we tried editing the 01.run_assembly.pl file and manually changing it to --rna but it didn't seem to work. Maybe we are missing something else that should be edited so rnaSPAdes can run properly?

If this doesn't pan out I will resort to running rnaSPAdes on the side and using the -extassembly option. But I really want to make it work all inside the SqueezeMeta pipeline if possible

jtamames commented 3 years ago

Hello Which mode are you using? If you are using merged or seqmerge, the pipeline would be calling to 01.run_assembly_merged.pl instead, and the change should be done in that one script. In 01.run_assembly.pl, changing the presets for spades must work. In lines 78-79:

  if(-e $par2name) { $command="$spades_soft  --meta --pe1-1 $par1name --pe1-2 $par2name -m 400 -k 21,33,55,77,99,127 $assembler_options -t $numthreads -o $datapath/spades >> $syslogfile 2>&1"; } else { $command="$spades_soft --meta --s1 $par1name  -m 400 -k 21,33,55,77,99,127 $assembler_options -t $numthreads -o $datapath/spades >> $syslogfile"; }

Edit the --meta to --rna and remove the -k list of k-mers. Do not forget to do it both in the if and in the else. If it doesn't work, please send me the syslog file.

Best, J

RitaBogorad commented 3 years ago

In general, I wouldn't trust very low level annotations (whether the plant is peruvian, chinese or whatever) and less so for eukaryotes. SqueezeMeta is normally very conservative when assigning taxonomies, but if you are using the nofilter or prokfilter modes then it will be much less strict, particularly for eukaryotes.

In any case, the fact that blastn also marks those as Arabidopsis makes me think that the problem does not come from SqueezeMeta. Blastn and SqueezeMeta both agree on assigning those reads to a plant, so I'd say some "planty" reads are definitely present in those samples, even if they are supposed to be bacteria-only controls. I'd say that cross-contamination is the most likely explanation.

Thank you very much for your detailed answers, I really appreciate and value your help! I used the algae genome assembly (produced by my colleagues) to see if the "plant" reads in my metatranscriptome blast well or not. Some of the top abundant eukaryote reads have >95% identity (and probably came from the cross-contamination) but some have a rather low 85% identity or no hits at all. But the contigs are also quite short so it makes the validation even harder.

You can esily add an extra functional database when running SqueezeMeta (see the ReadMe, section "6. Using external databases for functional annotation"). However, only predicted genes get functionally annotated, so it will not help for eukaryotic genes not predicted by prodigal/blastx. If you or your colleages already have a GFF/GTF containing the gene/transcript predictions for your algae, there might be a way to hack it into the SqueezeMeta pipeline.

I did gain access to the GFF and GFT with transcript predictions. Could you please let me know if you know how can I incorporate it into the squeezemeta pipeline?

Thank you once again for everything!

fpusan commented 3 years ago

Ok, so here's the thing.

In the SqueezeMeta pipeline, steps 1 to 3 deal with assembly, prediction of RNA coding regions and prediction of protein coding regions, respectively.

Since you alrady have an assembly and a GFF, you don't need to do that, but still, you have to trick SqueezeMeta so it thinks that it ran everything by itself. Try the following:

1) Run SqueezeMeta adding the following parameters -extassembly genome_of_your_model_organism.fasta -test 2. This will skip assembly and instead use the genome you already have. This will also stop the pipeline after step 2, so the hacking can begin. 2) Look at the <project_path>/results/01.<project_name>.fasta file. SqueezeMeta will have renamed your contigs. Modify your GFF file so the contig names in the gff file match those inside the <project_path>/results/01.<project_name>.fasta file. 2) Make a subset of your GFF file so that it only contains the protein coding sequences. Rename the ORFs in the GFF file so that they follow the SqueezeMeta convention contig_name_start-end (e.g. ID=Merged_6_81-545). 3) Now you need to generate 3 files:

<project_path>/results/03.<project_name>.gff: this should be your GFF modified as discussed above, and concatenated with the following file <project_path>/temp/02.<project_name>.rna.gff
<project_path>/results/03.<project_name>.faa: fasta file containing the aminoacid sequences for the protein-coding ORFs included in/results/03..gff`. You can use something like bedtools to extract them.
<project_path>/results/03.<project_name>.fna: fasta file containing the nucleotide sequences for the protein-coding ORFs included in/results/03..gff. You can use something like bedtools to extract them. 4) Restart the project skipping step 3, by runningrestart.pl -step 4`.

Fun times!

regisjr42 commented 3 years ago

Hello Which mode are you using? If you are using merged or seqmerge, the pipeline would be calling to 01.run_assembly_merged.pl instead, and the change should be done in that one script. In 01.run_assembly.pl, changing the presets for spades must work. In lines 78-79:
  if(-e $par2name) { $command="$spades_soft  --meta --pe1-1 $par1name --pe1-2 $par2name -m 400 -k 21,33,55,77,99,127 $assembler_options -t $numthreads -o $datapath/spades >> $syslogfile 2>&1"; } else { $command="$spades_soft --meta --s1 $par1name  -m 400 -k 21,33,55,77,99,127 $assembler_options -t $numthreads -o $datapath/spades >> $syslogfile"; } 
Edit the --meta to --rna and remove the -k list of k-mers. Do not forget to do it both in the if and in the else. If it doesn't work, please send me the syslog file.

Best, J

I tried it again and it still didn't work. It is possible that we may have tinkered a bit too much with the 01.run_assembly.pl code.

I have zipped the syslog file and uploaded it on WeTransfer. Let me know if this is okay for you.

jtamames commented 3 years ago

Hello If you take a quick look at the syslog, you will see SPAdes reporting this error:

0:31:48.003 2G / 9G ERROR K-mer Counting (kmer_data.cpp : 354) The reads contain too many k-mers to fit into available memory. You need approx. 217.109GB of free RAM to assemble your dataset

That's why SPAdes crashed.

Best, J

regisjr42 commented 3 years ago

Hello If you take a quick look at the syslog, you will see SPAdes reporting this error:

0:31:48.003 2G / 9G ERROR K-mer Counting (kmer_data.cpp : 354) The reads contain too many k-mers to fit into available memory. You need approx. 217.109GB of free RAM to assemble your dataset

That's why SPAdes crashed.

Best, J

I thought 217GB was a bit insane and in the end I think the -k parameter had something wrong with it. I tried again afterward and now when I look at the syslog the following error is presented:

== Error == Please specify option (e.g. -1, -2, -s, etc) for the following paths: 12

Tried looking in the manuals for something that could indicate what this means but didn't really find anything. Any guesses what this might be?

jtamames commented 3 years ago

That means that the specification for paired reads (-1 -2) or single reads (-s) is missing. Did you specify any of these options?

regisjr42 commented 3 years ago

My command is

SqueezeMeta.pl -m coassembly -p AllCtrl_spades -s /home/rantonioli/workspace/dataset/AllCtrl_equiv -f /home/rantonioli/workspace/dataset/fastq -t 16 --nobins -map bwa -a spades

My understanding at first was that the SqueezeMeta pipeline would take care of providing that information to spades using the equiv file, but if that's not the case then I need to put the necessary arguments (-1 -2 for my paired reads) in -assembly_options then?

(EDIT: I had pasted here the wrong command before, now it is accurate to what I am actually using, I had tried with the -assembly_options specified before and that didn't work either, but with different errors)

jtamames commented 3 years ago

Yes, you are right. SqueezeMeta should pass the read files to SPAdes. This is weird, because the command line that was in the previous syslog was:

/home/rantonioli/anaconda3/envs/SqueezeMeta/SqueezeMeta/scripts/SqueezeMeta.pl -m coassembly -p AllCtrl_spades -s /home/rantonioli/workspace/dataset/AllCtrl_equiv -f /home/rantonioli/workspace/dataset/fastq -t 16 --nobins -map bwa -a spades

Exactly the same than the one you posted, and there SPAdes was working (and crashed because of memory issues). Could you check that everything is fine (e.g, raw reads are in place), and send me the syslog and samples file (AllCtrl_equiv)?

regisjr42 commented 3 years ago

Raw reads and all the other stuff are all supposed to be in place. Here are the syslog and samples file for that last try.

When you talk about this

Exactly the same than the one you posted, and there SPAdes was working (and crashed because of memory issues).

The only difference between the two is that before when it crashed due to memory issues the -k values were all still there in the 01.run_assembly.pl file, and now they are removed

jtamames commented 3 years ago

Hello again I cannot reproduce, SPAdes is running correctly in the latest 1.3.1 version. But checking your syslog, I see that the options are:

-m 400 -k -t 16

It is that empty -k argument which is crashing the program. Perhaps you didn't remove it completely from your modified command line in 01.run_assembly.pl? Could you please check?

Best, J

regisjr42 commented 3 years ago

Hey, sorry for the delay in my answers. Our server is a little busy lately so I'm going on a slower pace with my stuff.

Now it actually started running SPAdes but it stopped in the middle, something with the PHRED offset now. As follows: == Error == system call for: "['/home/rantonioli/anaconda3/envs/SqueezeMeta/SqueezeMeta/bin/SPAdes/hammer', '/home/rantonioli/workspace/dataset/AllCtrl_spades/data/spades/corrected/configs/config.info']" finished abnormally, err code: 255

If you would like to see the syslog again you can download it here

jtamames commented 3 years ago

Hello This is a problem with SPAdes. You can take a look at the SPAdes repo, since some users have had related problems, for instance: https://github.com/ablab/spades/issues/37

There it was solved adding -phred-offset 33 when running SPAdes, so you can put that in the SPAdes command line.

Best, Javier

regisjr42 commented 3 years ago

Hello This is a problem with SPAdes. You can take a look at the SPAdes repo, since some users have had related problems, for instance: ablab/spades#37

There it was solved adding -phred-offset 33 when running SPAdes, so you can put that in the SPAdes command line.

Best, Javier

With the --phred-offset 33 specified SPAdes actually ran normally, but then we encountered a new issue, that I searched for directly in the SPAdes repo, it was the same error mentioned here and here, and found out it was back again to the memory issue if that's really the case.

The error message:

== Error == system call for: "['/home/rantonioli/anaconda3/envs/SqueezeMeta/SqueezeMeta/bin/SPAdes/hammer', '/home/rantonioli/workspace/dataset/AllCtrl_spades/data/spades/corrected/configs/config.info']" finished abnormally, err code: -9

To see if that was the case, as I already ran a successful test SPAdes job in this same server, albeit with a much smaller sample, I tried running SqueezeMete, using rnaSPAdes, but with a much smaller group than I was using before. SPAdes actually finished its pipeline but then another error popped up:

Running assembly with spades Running prinseq (Schmieder et al 2011, Bioinformatics 27(6):863-4) for selecting contigs longer than 200 ERROR: could not find input file "/home/rantonioli/workspace/dataset/Oil29_spades/data/spades/contigs.fasta". Try 'perl prinseq-lite.pl -h' for more information. Exit program. Error running command: /home/rantonioli/anaconda3/envs/SqueezeMeta/SqueezeMeta/bin/prinseq-lite.pl -fasta /home/rantonioli/workspace/dataset/Oil29_spades/data/spades/contigs.fasta -min_len 200 -out_good /home/rantonioli/workspace/dataset/Oil29_spades/results/prinseq; mv /home/rantonioli/workspace/dataset/Oil29_spades/results/prinseq.fasta /home/rantonioli/workspace/dataset/Oil29_spades/results/01.Oil29_spades.fasta > /dev/null 2>&1 at /home/rantonioli/anaconda3/envs/SqueezeMeta/SqueezeMeta/scripts/01.run_assembly.pl line 131. Stopping in STEP1 -> 01.run_assembly.pl. Program finished abnormally

I tried looking at the syslog but found nothing, it only says the SPAdes pipeline finished and the prinseq pipeline started, it seems the issue might be somewhere around here now. This contigs.fasta file it mentions indeed does not exist inside the folder, as I checked for it, but it seems like something the job itself should have created.

jtamames commented 3 years ago

I am afraid that SPAdes returned no results. Take a look at the data/spades folder, you will find the SPAdes log there, and check what it says. Best,

regisjr42 commented 3 years ago

I checked it but it seems normal to me. I will link it here if you want to check for yourself. It might be something dumb as I am very inexperienced with this, sorry if that ends up being the case.

jtamames commented 3 years ago

It is not you, it is me lol

I have found the potential problem. See my last response in #200

Best, J

regisjr42 commented 3 years ago

I see, I think the issue is quite similar indeed.

So should I use the transcripts.fasta as a source for -extassembly or should I run SqueezeMeta with --singletons?

jtamames commented 3 years ago

You can do both, specify -extassembly and --singletons.

regisjr42 commented 3 years ago

The pipeline finished successfully, using transcripts.fasta and --singletons, now I will check on R if it solved the original issue of the unmapped reads.

In the meantime, I also have metagenome data of these same samples, and I know SqueezeMeta has a pipeline to use both metagenomic and metatranscriptomic data at the same time, but I do not know exactly how to input both at the same time and haven't found clear instructions in the manual on how to do it, should I create a new topic so we could go over this, or do we do it here?

fpusan commented 3 years ago

Hi! Maybe this will be of help? https://github.com/jtamames/SqueezeMeta/wiki/Combining-metagenomes-and-metatranscriptomes-with-SqueezeMeta

regisjr42 commented 3 years ago

Hello again! It's been a while but I had to take some time with other stuff and now I'm back working on this project.

Hi! Maybe this will be of help? https://github.com/jtamames/SqueezeMeta/wiki/Combining-metagenomes-and-metatranscriptomes-with-SqueezeMeta

Yes! That did help a lot and shed some light on this feature for us, thanks @fpusan !

Now going back to my original issues:

After (finally) making the pipeline work with my data, I went into R to analyze it and see if the unmapped reads problem was still present but I feel it still showed some significant percentage of unmapped reads, albeit much less than before. I need to look more into the literature of this (metatranscriptome of marine sediment) as I'm sort of new into this, so maybe that's somewhat normal for this kind of sequencing. Also to make a new run with a more similar group to my first one so I can better compare a before and after for all the fixes.
Regarding the issue with SqueezeMeta not reading the transcripts.fasta file is there any way that I can edit some lines to make SqueezeMeta look for transcripts.fasta instead of contigs.fasta or something similar to that? Or should I just start always assembling with Spades separately and using the -extassembly option for my runs?

fpusan commented 3 years ago

Let us know how the new run goes. We are used to getting worse mapping percentages in the metatranscriptomes, but every extra info you find will be helpful.

Regarding your second question, I just coded 4aadbf8, which hopefully will do what you want. It would be get if you can test it. 1) Please download the following files:

https://raw.githubusercontent.com/jtamames/SqueezeMeta/rnaspades/scripts/01.run_assembly.pl
https://raw.githubusercontent.com/jtamames/SqueezeMeta/rnaspades/scripts/01.run_assembly_merged.pl
https://raw.githubusercontent.com/jtamames/SqueezeMeta/rnaspades/scripts/SqueezeMeta.pl 2) Use those to replace the ones present in your current SqueezeMeta installation (in /home/rantonioli/anaconda3/envs/SqueezeMeta/SqueezeMeta/scripts according to your previous messages). 3) Run SqueezeMeta again, specifying rnaspades as the assembler. This should take care of removing the --meta flag, adding the --rna flag, and looking for the results in transcripts.fasta rather than contigs.fasta.

Let me know if that works!

regisjr42 commented 3 years ago

Hi, I ran SqueezeMeta again using the new files and I have some inputs on it. I did actually run into some errors the same as before, without getting into much detail again as they were already mentioned here I removed again the -k parameter and added --phred-offset 33 to the command line so it actually went ahead.

But then I faced again the "error code: -9"

== Error == system call for: "['/home/rantonioli/anaconda3/pkgs/squeezemeta-1.3.1-pl526r36_0/SqueezeMeta/bin/SPAdes/hammer', '/home/rantonioli/workspace/dataset/Ctrl29_SM_spades/data/spades/corrected/configs/config.info']" finished abnormally, err code: -9

Still not sure about this one, it is probably related to memory but I don't know as I successfully made a run with a similar sample before. In the Spades repo I saw some comments on it being related to the version being used, the dev asked if the problem persisted on the more recent 3.14 version and I noticed SqueezeMeta uses the 3.11 version (as seen here).

That's it for now, still testing more options here and looking for a successful run, but it seems the most plausible issue is most likely available memory now.

fpusan commented 3 years ago

How much available memory do you have? Can you test with a smaller set of sequences, just to double-check that the --rna mode is being launched correctly?

jtamames / SqueezeMeta

Problem with high percentage (over 70%) of "unmapped" reads #226