transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
53 stars 36 forks source link

error while running run_DESeq_stats.R #81

Closed rachel1898 closed 5 months ago

rachel1898 commented 11 months ago

Hello,

I had an error when I try to run DESeq_stats through master_script.sh Steps 1 to 5 went perfectly but when it gets to step 6 I had the following error:

[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename" Working directory is /home/samsa2/output/step_5_output/RefSeq_results/org_results Error in match.names(clabs, names(xi)) : names do not match previous names Calls: rbind ... eval -> eval -> eval -> rbind -> rbind -> match.names In addition: Warning message: NAs introduced by coercion Execution halted 'Rscript /home/samsa2/R_scripts/run_DESeq_stats.R -I /home/samsa2/output/step_5_output/RefSeq_results/org_results -O RefSeq_org_DESeq_results.tab -R /home//samsa2/output/step_2_output/raw_counts.txt' exited with non-zero status 1

I think the problem is that control files doesn't have the same column names but as I run the master_script.sh I could not figure out what went wrong. Would you guide me through this?

Thanks!! Raquel

lisakmalins commented 7 months ago

Hi @rachel1898, I got the same error message when I was running SAMSA2 and I was able to fix it so I might be able to help. Did you make sure your filenames all start with control_ and experimental_ before starting the pipeline? Could you post a picture of your step_2_output/raw_counts.txt file?

lisakmalins commented 7 months ago

Question for the developers: Was SAMSA2 written expecting the sample names to be strictly numeric? I ran into two errors in run_DESeq_stats.R which I was able to fix by modifying the script. First I got the same error that @rachel1898 posted above because my sample names were not numeric, so this operation (line 134) replaced them all with NA:

raw_counts_table$X2 <- as.numeric(as.character(raw_counts_table$X2))

So then when those values got used as column names, it made rbind (line 145) throw that error because it needs the column names to be the same between the two dataframes.

Later in the script I had another issue with my count values being turned into factors and introducing more NAs, which I think also might have happened because of my sample names not being numeric. I was able to fix the problem by modifying a few lines of the script, but I was wondering if maybe the real issue was that I misunderstood the usage instructions and should have changed my filenames before running the pipeline to avoid any problems.

Can you provide some guidance on what characters are allowed to be in the input filenames?

transcript commented 6 months ago

Hello,

No, SAMSA2 wasn't written to explicitly expect numeric samples (although I believe that whitespace characters can sometimes mess things up).

I suspect what's happening is that, if the samples are too different, the rbind command leads to a bunch of additional rows because it can't find any common rows to merge on.

One useful check: look at the head of a couple of your files and see if they match the example files in https://github.com/transcript/samsa2/blob/master/sample_files_paired-end/6_RefSeq_org_results/control_1_TINY.RefSeq_annot_organism.tsv or similar.

Another option is to run this in RStudio and see if any of the intermediate tables look invalid.

Finally, if none of this is able to yield results or if you want me to look at one or two of the input files to see if I spot any inconsistencies, you could drop me an email (swestreich@gmail.com) with one or two attached.

Sorry to hear you're having issues with my pipeline, and I hope I can resolve them!

lisakmalins commented 5 months ago

Hi @transcript, thank you for the thoughtful response!

I have a reproducible example of the behavior that @rachel1898 and I observed:

git clone https://github.com/transcript/samsa2.git
cd samsa2
# Use the sample files, but remove an underscore
# control_1_TINY_R1.fastq --> control_1TINY_R1.fastq
cp -r sample_files_paired-end/1_starting_files input_files
for f in input_files/*; do mv $f ${f/_TINY/TINY}; done
bash setup_and_test/package_installation.bash
bash setup_and_test/full_database_download.bash
bash bash_scripts/master_script.sh
# run_DESeq_stats.R fails

Error message:

[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename"
Working directory is  /redacted/path/to/folder/samsa2/output_files/step_5_output/RefSeq_results/org_results
Error in match.names(clabs, names(xi)) :
  names do not match previous names
Calls: rbind ... eval -> eval -> eval -> rbind -> rbind -> match.names
In addition: Warning message:
NAs introduced by coercion
Execution halted
'Rscript /redacted/path/to/folder/samsa2/R_scripts/run_DESeq_stats.R -I /redacted/path/to/folder/samsa2/output_files/step_5_output/RefSeq_results/org_results -O RefSeq_org_DESeq_results.tab -R /redacted/path/to/folder/samsa2/output_files/step_2_output/raw_counts.txt' exited with non-zero status 1

The problem is that in order to parse the information out of the filenames, run_DESeq_stats.R splits them by underscore into fields, expects the second field to be numeric, and after transposing uses it as column names for rbind.

# Using example data without messing with filenames
                                   V1   V2           X1 X2                   X3
1      control_1_TINY.cleaned.forward 2719      control  1 TINY.cleaned.forward
2      control_2_TINY.cleaned.forward 2695      control  2 TINY.cleaned.forward
3 experimental_3_TINY.cleaned.forward 2682 experimental  3 TINY.cleaned.forward
4 experimental_4_TINY.cleaned.forward 2684 experimental  4 TINY.cleaned.forward

However, if the filenames do not follow that pattern and the second field is not numeric, NA's are induced:

# Example with second underscore removed from filenames
                                  V1   V2
1      control_1TINY.cleaned.forward 2719
2      control_2TINY.cleaned.forward 2695
3 experimental_3TINY.cleaned.forward 2682
4 experimental_4TINY.cleaned.forward 2684

# Split on underscore
                                  V1   V2           X1                    X2
1      control_1TINY.cleaned.forward 2719      control 1TINY.cleaned.forward
2      control_2TINY.cleaned.forward 2695      control 2TINY.cleaned.forward
3 experimental_3TINY.cleaned.forward 2682 experimental 3TINY.cleaned.forward
4 experimental_4TINY.cleaned.forward 2684 experimental 4TINY.cleaned.forward

# Coercing column X2 to numeric induces NA
# and prevents rbind with complete_table dataframe
                                  V1   V2           X1 X2
1      control_1TINY.cleaned.forward 2719      control NA
2      control_2TINY.cleaned.forward 2695      control NA
3 experimental_3TINY.cleaned.forward 2682 experimental NA
4 experimental_4TINY.cleaned.forward 2684 experimental NA

I have a fix that allows more flexibility in the filenames. Are you open to pull requests?

transcript commented 5 months ago

Fix by @lisakmalins added!