mvolar / SatXplor

A satDNA exploration pipeline
MIT License
3 stars 0 forks source link

Error running on custom data #3

Closed JosemaRico closed 3 months ago

JosemaRico commented 3 months ago

Hello, I have already run the test and it works correctly.

Now, testing with some of my own data, it starts to run and gives me this error on one of the satellites (the most abundant according to quantifications):

WARNING - Error creating arrays for Sat01 error message: zero-size array to reduction operation maximum which has no identity

Also, for other satellites it gives me the following error: WARNING - No peaks were found for Sat14, setting default extension factor, reported error attempt to get argmax of an empty sequence

Is it normal for these errors to appear?

The program continues running with apparent normality until:

2024-08-19 14:41:32,575 - INFO - Alignment completed for ./results/data/sequences/Sat29_monomers.fasta. Output saved to ./results/data/sequences/Sat29_monomers_aligned.fasta Error running MAFFT for ./results/data/sequences/Sat01_monomers.fasta. Error message: Command '['mafft', '--adjustdirection', '--reorder', '--quiet', '--thread', '6', './results/data/sequences/Sat01_monomers.fasta']' returned non-zero exit status 1. 2024-08-19 14:41:32,635 - WARNING - Error in mafft for Sat01_monomers.fasta, error message: Command '['mafft', '--adjustdirection', '--reorder', '--quiet', '--thread', '6', './results/data/sequences/Sat01_monomers.fasta']' returned non-zero exit status 1. 2024-08-19 14:41:32,758 - INFO - Alignment completed for ./results/data/sequences/Sat04_monomers.fasta. Output saved to ./results/data/sequences/Sat04_monomers_aligned.fasta 2024-08-19 14:41:35,979 - INFO - Alignment completed for ./results/data/sequences/Sat06_monomers.fasta. Output saved to ./results/data/sequences/Sat06_monomers_aligned.fasta Error running MAFFT for ./results/data/sequences/Sat14_monomers.fasta. Error message: Command '['mafft', '--adjustdirection', '--reorder', '--quiet', '--thread', '6', './results/data/sequences/Sat14_monomers.fasta']' returned non-zero exit status 1. 2024-08-19 14:41:36,042 - WARNING - Error in mafft for Sat14_monomers.fasta, error message: Command '['mafft', '--adjustdirection', '--reorder', '--quiet', '--thread', '6', './results/data/sequences/Sat14_monomers.fasta']' returned non-zero exit status 1.

The program continues running with apparent normality until:

2024-08-19 14:42:22,995 - INFO - Running PCA umap for ./results/data/sequences/Sat28_monomers_aligned.fasta --- Logging error --- Traceback (most recent call last): File "/root/micromamba/envs/myenv/lib/python3.10/logging/init.py", line 1100, in emit msg = self.format(record) File "/root/micromamba/envs/myenv/lib/python3.10/logging/init.py", line 943, in format return fmt.format(record) File "/root/micromamba/envs/myenv/lib/python3.10/logging/init.py", line 678, in format record.message = record.getMessage() File "/root/micromamba/envs/myenv/lib/python3.10/logging/init.py", line 368, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/root/app/satxplor/controller.py", line 298, in main() File "/root/app/satxplor/controller.py", line 274, in main fun_list[i]() File "/root/app/satxplor/controller.py", line 219, in run_pca executor.map(run_pca_script, matching_files) File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/process.py", line 766, in map results = super().map(partial(_process_chunk, fn), File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 610, in map fs = [self.submit(fn, args) for args in zip(iterables)] File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/_base.py", line 610, in fs = [self.submit(fn, args) for args in zip(iterables)] File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/process.py", line 738, in submit self._start_executor_manager_thread() File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/process.py", line 678, in _start_executor_manager_thread self._launch_processes() File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/process.py", line 705, in _launch_processes self._spawn_process() File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/process.py", line 714, in _spawn_process p.start() File "/root/micromamba/envs/myenv/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/root/micromamba/envs/myenv/lib/python3.10/multiprocessing/context.py", line 281, in _Popen return Popen(process_obj) File "/root/micromamba/envs/myenv/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/root/micromamba/envs/myenv/lib/python3.10/multiprocessing/popen_fork.py", line 71, in _launch code = process_obj._bootstrap(parent_sentinel=child_r) File "/root/micromamba/envs/myenv/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/micromamba/envs/myenv/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(call_item.args, *call_item.kwargs) File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk return [fn(args) for args in chunk] File "/root/micromamba/envs/myenv/lib/python3.10/concurrent/futures/process.py", line 205, in return [fn(*args) for args in chunk] File "/root/app/satxplor/r_script_runners.py", line 18, in run_pca_script logger.warning(f"Error executing R script for {alignment}. Return code:", result.returncode) Message: 'Error executing R script for ./results/data/sequences/Sat04_monomers_aligned.fasta. Return code:' Arguments: (1,)

It keeps running with errors for some of the satellites and the last lines are:

2024-08-19 14:55:36,431 - INFO - R script executed successfully for ./results/data/sequences/flanks/Sat17_microhomology_aligned.fasta. 2024-08-19 14:55:36,431 - INFO - Running finished at 2024-08-19 14:55:36.431202, deleting the checkpoints file. Copying results file to output directory /mnt/data/output_folder

mvolar commented 3 months ago

SatXplor is designed to allow errors on some satellites and continue on others if it finds some unexpected behaviours.

Are you using an chromosome level assembly or contig level? Also which program you used to get the satellite sequences?

Number one error is low quantitiy of monomers in the assembly or non true satellite sequences (i.e. the program failed to find any true arrays in the provided assembly)

Would you mind sending the results of the output directory (without the assembly or the satellites) so i can see if I can help you in some capacity.

Also I presume you are running the docker version of the app? Since the app is still in its early stages of development, there is a possibility we failed to anticipate some behaviours, even though we tested it on 4 different genomes and satellitomes.

JosemaRico commented 3 months ago

I'm using a chromosome-level genome assembly (downloaded from NCBI in .fna format and converted to .fasta format). The satellite sequences were identified using RepeatExplorer2/TAREAN, but not from the reads of the genome assembly. Instead, I used our own sequencing data (Illumina short-read sequences). When I ran RepeatMasker or RepeatProfiler, all the sequences in our reads were identified as satellite sequences. I´m using the docker version.

I attach the output files. I noticed that in the sequences folder within the data folder, for Sat01 (for example), there are two files: Sat01-575_monomer_dimers.fasta and Sat01-575_monomers.fasta. However, for Sat02 (and others), in addition to these two files, there is a third one: Sat02-1132_monomers_aligned.fasta. Interestingly, the blast_output.tsv table shows that Sat01 is found with Blast, despite the absence of an aligned file. In adition, Additionally, I have tried running a BLAST search on their official website (https://blast.ncbi.nlm.nih.gov/Blast.cgi) directly on the assembled genome that I am using, and it does indeed find monomers of Sat01. Don't worry, I understand that the program is very new. I’m mostly testing it and also want to help you resolve some issues. output_folder1.zip output_folder2.zip

Thanks you!

mvolar commented 3 months ago

Thank you for being an early adopter and a "beta-tester"!

From what I see from the data, I think your satellites are very divergent, i.e. Sat01 is probably a trimer of the same subunit, Sat02 is rather large for a satellite and forms short tandems.

I would suggest editing the satxplor/utils/constants.py file (you can use vim/nano in the interactive shell within the docker container) where many of the filtering/processing cosntants are defined, and seem to filter out much of the found blast hits:

#plot constants
MAX_PLOT_LEN = 5000 #leave this
HISTOGRAM_BIN_WIDTH = 50 #you can also leave this
NKERNEL_BINS = 100 #you can also leave this
FLANK_SIZE=500 #if your monomers are on the edges of chromosomes, put this to 0
ARRAY_HOR_PERC=0.01 #you can put this to 0.001
MONOMER_NUMBER=4 #MOST important, put this to 1 or 0.1
CONTIG_FILTER=100000 
DIMENSION_RED_MODE="both"
PERC_ID_FILTER=90.0 #ALSO IMPORTANT for Sat01 i would put this to 50.0 
QCOVHSP_FILTER=90.0 #and this to 20.0
SQUISH=True
SQUISH_LEN=2500
JosemaRico commented 3 months ago

I modified constants.py to:

plot constants

MAX_PLOT_LEN = 5000 HISTOGRAM_BIN_WIDTH = 50 NKERNEL_BINS = 100 FLANK_SIZE=500 ARRAY_HOR_PERC=0.001 MONOMER_NUMBER=1 CONTIG_FILTER=100000 DIMENSION_RED_MODE="both" PERC_ID_FILTER=50.0 QCOVHSP_FILTER=20.0 SQUISH=True SQUISH_LEN=2500

Can you explain to me what I am changing? From a biological standpoint, what would be the most correct approach? Shouldn't PERC_ID_FILTER be set to 70% if we consider two satellites to be different when they have less than 70% similarity? Or does this have nothing to do with it?

Now, these errors do not appear:

WARNING - Error creating arrays for Sat01 error message: zero-size array to reduction operation maximum which has no identity

Also, for other satellites it gives me the following error: WARNING - No peaks were found for Sat14, setting default extension factor, reported error attempt to get argmax of an empty sequence

Now it gives me other errors that I believe come from the code in r_script_runners.py (I think there are formatting errors in the code). These errors are related to a formatting issue with the logging function in Python, specifically when trying to log a message. The logger.warning function is receiving a formatted string with f"..." and then a second argument result.returncode. However, the warning function expects a single message, and additional arguments should be used if the old % syntax is being used. The solution is to combine alignment and result.returncode into a single string before passing it to logger.warning:

In satxplor/r_script_runners.py On lines 18 and 19:

            logger.warning(f"Error executing R script for {alignment}. Return code:", result.returncode)
            logger.warning("STDERR:\n", result.stderr)

I think it should be changed to:

            logger.warning(f"Error executing R script for {alignment}. Return code: {result.returncode}")
            logger.warning(f"STDERR:\n{result.stderr}")

Likewise, in lines 37 and 38 of this same script:

            logger.info(f"Error executing R script for {distance_matrix}. Return code:", result.returncode)
            logger.info("STDERR:\n", result.stderr)

I think it should be changed to:

            logger.info(f"Error executing R script for {distance_matrix}. Return code: {result.returncode}")
            logger.info(f"STDERR:\n{result.stderr}")

Likewise, in lines 71 and 72 of this same script:

            logger.info(f"Error executing R script for {mhl}. Return code:", result.returncode)
            logger.info("STDERR:\n", result.stderr)

I think it should be changed to:

            logger.info(f"Error executing R script for {mhl}. Return code: {result.returncode}")
            logger.info(f"STDERR:\n{result.stderr}")

When I change these lines, I don't see any formatting errors in the code, but rather an error when running the script with a specific satellite, for example, for the Sat14, I see this error:

2024-08-21 13:16:49,153 - WARNING - Error executing R script for ./results/data/sequences/Sat14-791_monomers_aligned.fasta. Return code: 1
2024-08-21 13:16:49,153 - WARNING - STDERR:
Saving 7 x 7 in image
Saving 7 x 7 in image
Error: umap: number of neighbors must be smaller than number of items
In addition: Warning messages:
1: Removed 5 rows containing missing values or values outside the scale range
(`geom_bar()`).
2: Removed 5 rows containing missing values or values outside the scale range
(`geom_text()`).
Execution halted

Also, i see the next error for some satellites (Sat22, Sat28, Sat11, Sat19, Sat14)

2024-08-21 13:33:48,649 - INFO - Error executing R script for ./results/pictures/dimensions/Sat22-253_aligned_matrix.csv.gz. Return code: 1
2024-08-21 13:33:48,649 - INFO - STDERR:
Warning message:
In melt.data.table(matrix) :
  id.vars and measure.vars are internally guessed when both are 'NULL'. All non-numeric/integer/logical type columns are considered id.vars, which in this case are columns [names, ...]. Consider providing at least one of 'id' or 'measure' vars in future.
Warning message:
In min(mval, na.rm = TRUE) : no non-missing arguments to min; returning Inf
Error in data.frame(LinksDF, colour = linkColour) :
  arguments imply differing number of rows: 0, 1
Calls: forceNetwork -> data.frame
In addition: Warning message:
It looks like Source/Target is not zero-indexed. This is required in JavaScript and so your plot may not render.
Execution halted

After these errors, the program ends without further errors If you need, I can send you the output folder by email since it is too large to upload here

Thanks you.

mvolar commented 3 months ago
            logger.warning(f"Error executing R script for {alignment}. Return code: {result.returncode}")
            logger.warning(f"STDERR:\n{result.stderr}")

I will fix these errors first on the github source and then on the docker version of SatXplor then push them, this will probably be done tommorow, but i'll let you know.

WARNING - No peaks were found for Sat14, setting default extension factor, reported error attempt to get argmax of an empty sequence

This warning signifies that for Sat14 there exist no regular periodic gaps within the tandem, there are a couple of visualizations in the paper (under citation) which highlight common behaviors we noticed in satellite organization.

2024-08-21 13:16:49,153 - WARNING - Error executing R script for ./results/data/sequences/Sat14-791_monomers_aligned.fasta. Return code: 1
2024-08-21 13:16:49,153 - WARNING - STDERR:
Saving 7 x 7 in image
Saving 7 x 7 in image
Error: umap: number of neighbors must be smaller than number of items
In addition: Warning messages:
1: Removed 5 rows containing missing values or values outside the scale range
(`geom_bar()`).
2: Removed 5 rows containing missing values or values outside the scale range
(`geom_text()`).
Execution halted

These last errors are more like soft errors which happend for some satellites when processing. Design choice was made because it is hard to predict perfect code execution for each satellite ever. The error stems when the data (in this case alignments) cannot be meaningfully seperated by PCA/UMAP/graph networks. This is most likely due to the nature of the satellite sequence in the genome i.e. the entire satellite is 10-100 identical copies, or satellite sequence is only present on one chromosome in a array<5 etc etc.

To look into this you would need to open the annotations on the genome on the IGV or geneious and see why this happens, as there is no general rule of thumb nor a way to guarantee perfect code execution for each satellite ever.

As for the constants I will add documentation on the subject to better explain the constants parameters in a couple of days, but briefly:

#plot constants
MAX_PLOT_LEN = 5000
HISTOGRAM_BIN_WIDTH = 50
NKERNEL_BINS = 100
FLANK_SIZE=500 -> when we do flank analysis how far away do we look
ARRAY_HOR_PERC=0.001 -> see paper on HOR discussion, basically if a certain number of monomers are found with periodic breaks we want to link them together and get that region out
MONOMER_NUMBER=1 -> number of monomers to declare an array
CONTIG_FILTER=100000 -> if we use low quality assemblies (contig level), usually <100kb is trash filled with centromeres so we want to filter this
DIMENSION_RED_MODE="both" -> are we using PCA or UMAP
PERC_ID_FILTER=50.0 -> percentage identity, usually we used 70% perc_id, 70% qcovhsp, however these parameters are VERY prone to change due to the divergent nature of satDNA
QCOVHSP_FILTER=20.0 -> query coverage percentage (akin to alignment length)
SQUISH=True -> Setting if we want to perform squishing (see below)
SQUISH_LEN=2500 -> This is for edge detection, first 2.5kb and last 2.5kb or array are used (or the entire array if len<5kb). This is done for speed.
JosemaRico commented 3 months ago

Thank you very much, I close the issue