How to generate a de novo assembly (Meccus pallidipennis)? (Question)

mfonseca91 commented 1 day ago

Hello, any suggestions to perform a de novo assembly of Meccus pallidipennis and how to generate the corresponding config file? Considering that I have a fastq.qz file with 5.93 million unphased reads generated by the Ligation Sequencing DNA V14 kit (SQK-LSK114-XL) and the flow cell FLO-MIN114?

paoloshasta commented 1 day ago

We don't have an assembly configuration optimized for your situation, but I suggest using, at least as a starting point, one of the two we developed for the ULK114 toolkit as announced by ONT in May 2024:

If you used HERRO error correction after base calling, use --config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024.conf.
If you did not use HERRO error correction and you are working with reads out of the standard base calling process, use --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024.conf.

Since your reads are shorter than the ULK114 reads the results will not be optimal. But if you can post here the following output files from your assembly I may then be able to suggest tweaks for improvement:

AssemblySummary.html
stdout.log

Do you have an estimate of genome size for this species? Any idea of heterozygosity rate?

mfonseca91 commented 16 hours ago

We don't have an assembly configuration optimized for your situation, but I suggest using, at least as a starting point, one of the two we developed for the ULK114 toolkit as announced by ONT in May 2024:
* If you used HERRO error correction after base calling, use `--config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024.conf`.

* If you did not use HERRO error correction and you are working with reads out of the standard base calling process, use `--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024.conf`.
Since your reads are shorter than the ULK114 reads the results will not be optimal. But if you can post here the following output files from your assembly I may then be able to suggest tweaks for improvement:
* `AssemblySummary.html`

* `stdout.log`
Do you have an estimate of genome size for this species? Any idea of heterozygosity rate?

Thanks for your response, I tried running Shasta with the aforementioned configuration files, however, these are not available within Shasta's --config options. Where can I access or download the configuration files? Moreover, I would like to add that we don’t have data related to the genome size and heterozygosity rate of M. pallidipennis. However, the genome size of the closest species to M. pallidipennis is T. infestants, whose genome size is around 17,301 bp.

paoloshasta commented 15 hours ago

Sorry, my directions were incorrect. You should omit the .conf portion of the assembly configurations, because those are built-in configurations that don't require a configuration file. Make sure to use the latest Shasta release, 0.13.0. So you should use either:

--config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024

OR

--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024

I apologize for the confusion.

paoloshasta commented 15 hours ago

That genome size of 17301 bp seems too small. Are you sure that is not the number of genes instead?

mfonseca91 commented 11 hours ago

That genome size of 17301 bp seems too small. Are you sure that is not the number of genes instead?

I have run the de novo assembly using the following command:

/home/mfonseca/LGC_INMEGEN/Proyecto_chinches/Shasta/shasta-Linux-0.13.0 --input /home/mfonseca/LGC_INMEGEN/Proyecto_chinches/ONT_data_chinches/FAY11728_pass.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --anonymousmemoryMode --t 50

In the end, only the stdout.log file was generated. stdout.log

On the other hand, we have corroborated that there is no genome size or a specific number of genes for M. pallidipennis and the closest species (T. infestans).

paoloshasta commented 11 hours ago

There are several things happening here. Let's discuss them one at a time.

It appears that your assembly process was killed while still running. Is it possible that this happened because of a memory issue or other system limit issue? Did you see a "Killed" message on the process output? That message would not make it to stdout.log. How much memory does the machine you are using have? Please post the output of the following commands:

head -1 /proc/meminfo
ulimit -a

By the time the process was killed some other files should have been generated. Can you post here a list of the files that were created?

Your reads are very short, and the 10 Kb read length cutoff used by that assembly configuration resulted in most of your coverage being discarded: as you can see from line 19 of stdout.log, 10.1 Gb of coverage in reads shorter than 10 Kb were discarded, with the result that the assembly used only 0.8 Gb of coverage in reads longer than 10 Kb (see line 23 of stdout.log). So the first thing to do is decrease the read length cutoff. You can do this via option --Reads.minReadLength. For example, to reduce the read length cutoff to 5 Kb add the following to your command line:

--Reads.minReadLength 5000

To find a reasonable cutoff value you can use the information in Binned-ReadLengthHistogram.csv. Choose a value that keeps about 80% of your 11 Gb of coverage. Hopefully this will result in a read length cutoff that is not too low.

Given that your reads are so short I would also override the minimum length of an alignment by also adding the following to the command line:

--Align.minAlignedMarkerCount 100

Do you know why your reads are so short? The ULK114 reads described by ONT in May 2024 have a read N50 around 100 Kb, and even without an Ultra-Long (UL) protocol you should be able to easily get 30 to 50 Kb read N50. Working with these short reads will be a serious handicap for the assembly process even if we are able to optimize the assembly configuration.

mfonseca91 commented 10 hours ago

There are several things happening here. Let's discuss them one at a time.

It appears that your assembly process was killed while still running. Is it possible that this happened because of a memory issue or other system limit issue? Did you see a "Killed" message on the process output? That message would not make it to stdout.log. How much memory does the machine you are using have? Please post the output of the following commands:
head -1 /proc/meminfo
ulimit -a
By the time the process was killed some other files should have been generated. Can you post here a list of the files that were created?

Your reads are very short, and the 10 Kb read length cutoff used by that assembly configuration resulted in most of your coverage being discarded: as you can see from line 19 of stdout.log, 10.1 Gb of coverage in reads shorter than 10 Kb were discarded, with the result that the assembly used only 0.8 Gb of coverage in reads longer than 10 Kb (see line 23 of stdout.log). So the first thing to do is decrease the read length cutoff. You can do this via option --Reads.minReadLength. For example, to reduce the read length cutoff to 5 Kb add the following to your command line:
--Reads.minReadLength 5000
To find a reasonable cutoff value you can use the information in Binned-ReadLengthHistogram.csv. Choose a value that keeps about 80% of your 11 Gb of coverage. Hopefully this will result in a read length cutoff that is not too low.

Given that your reads are so short I would also override the minimum length of an alignment by also adding the following to the command line:
--Align.minAlignedMarkerCount 100
Do you know why your reads are so short? The ULK114 reads described by ONT in May 2024 have a read N50 around 100 Kb, and even without an Ultra-Long (UL) protocol you should be able to easily get 30 to 50 Kb read N50. Working with these short reads will be a serious handicap for the assembly process even if we are able to optimize the assembly configuration.

A “killed” message was not generated.

About the machine memory:

head -1 /proc/meminfo MemTotal: 1056443908 kB

ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 4126359 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4126359 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

List of files created:

Binned-ReadLengthHistogram.csv KmerFrequencyHistogram.csv performance.log ReadLowHashStatistics.csv stdout.log DuplicateReads.csv LowHashBucketHistogram.csv ReadLengthHistogram.csv shasta.conf SuppressedAlignmentCandidates.csv

About why we have short reads, we suspect it was related to the library preparation, most of the short fragments were not successfully eliminated.

paoloshasta commented 10 hours ago

Ok, your machine configuration looks fine, but we have no explanation why the assembly was interrupted. Is it possible that you are using a batch system that requires you to specify a time limit? Or other resource limitations? Are you running interactively or under a job submission system?

In any event, try running again, this time using my above suggestions:

--config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --Reads.minReadLength 5000 --Align.minAlignedMarkerCount 100

In addition to stdout.log, please also post the following for the new assembly:

Binned-ReadLengthHistogram.csv for the new assembly, so I can get a better idea of your read length distribution.
performance.log

paoloshasta / shasta

How to generate a de novo assembly (Meccus pallidipennis)? (Question) #44