Closed rachel1898 closed 5 months ago
Hi @rachel1898, I got the same error message when I was running SAMSA2 and I was able to fix it so I might be able to help. Did you make sure your filenames all start with control_
and experimental_
before starting the pipeline? Could you post a picture of your step_2_output/raw_counts.txt
file?
Question for the developers: Was SAMSA2 written expecting the sample names to be strictly numeric? I ran into two errors in run_DESeq_stats.R
which I was able to fix by modifying the script. First I got the same error that @rachel1898 posted above because my sample names were not numeric, so this operation (line 134) replaced them all with NA:
raw_counts_table$X2 <- as.numeric(as.character(raw_counts_table$X2))
So then when those values got used as column names, it made rbind
(line 145) throw that error because it needs the column names to be the same between the two dataframes.
Later in the script I had another issue with my count values being turned into factors and introducing more NAs, which I think also might have happened because of my sample names not being numeric. I was able to fix the problem by modifying a few lines of the script, but I was wondering if maybe the real issue was that I misunderstood the usage instructions and should have changed my filenames before running the pipeline to avoid any problems.
Can you provide some guidance on what characters are allowed to be in the input filenames?
Hello,
No, SAMSA2 wasn't written to explicitly expect numeric samples (although I believe that whitespace characters can sometimes mess things up).
I suspect what's happening is that, if the samples are too different, the rbind
command leads to a bunch of additional rows because it can't find any common rows to merge on.
One useful check: look at the head
of a couple of your files and see if they match the example files in https://github.com/transcript/samsa2/blob/master/sample_files_paired-end/6_RefSeq_org_results/control_1_TINY.RefSeq_annot_organism.tsv or similar.
Another option is to run this in RStudio and see if any of the intermediate tables look invalid.
Finally, if none of this is able to yield results or if you want me to look at one or two of the input files to see if I spot any inconsistencies, you could drop me an email (swestreich@gmail.com) with one or two attached.
Sorry to hear you're having issues with my pipeline, and I hope I can resolve them!
Hi @transcript, thank you for the thoughtful response!
I have a reproducible example of the behavior that @rachel1898 and I observed:
git clone https://github.com/transcript/samsa2.git
cd samsa2
# Use the sample files, but remove an underscore
# control_1_TINY_R1.fastq --> control_1TINY_R1.fastq
cp -r sample_files_paired-end/1_starting_files input_files
for f in input_files/*; do mv $f ${f/_TINY/TINY}; done
bash setup_and_test/package_installation.bash
bash setup_and_test/full_database_download.bash
bash bash_scripts/master_script.sh
# run_DESeq_stats.R fails
Error message:
[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename"
Working directory is /redacted/path/to/folder/samsa2/output_files/step_5_output/RefSeq_results/org_results
Error in match.names(clabs, names(xi)) :
names do not match previous names
Calls: rbind ... eval -> eval -> eval -> rbind -> rbind -> match.names
In addition: Warning message:
NAs introduced by coercion
Execution halted
'Rscript /redacted/path/to/folder/samsa2/R_scripts/run_DESeq_stats.R -I /redacted/path/to/folder/samsa2/output_files/step_5_output/RefSeq_results/org_results -O RefSeq_org_DESeq_results.tab -R /redacted/path/to/folder/samsa2/output_files/step_2_output/raw_counts.txt' exited with non-zero status 1
The problem is that in order to parse the information out of the filenames, run_DESeq_stats.R
splits them by underscore into fields, expects the second field to be numeric, and after transposing uses it as column names for rbind
.
# Using example data without messing with filenames
V1 V2 X1 X2 X3
1 control_1_TINY.cleaned.forward 2719 control 1 TINY.cleaned.forward
2 control_2_TINY.cleaned.forward 2695 control 2 TINY.cleaned.forward
3 experimental_3_TINY.cleaned.forward 2682 experimental 3 TINY.cleaned.forward
4 experimental_4_TINY.cleaned.forward 2684 experimental 4 TINY.cleaned.forward
However, if the filenames do not follow that pattern and the second field is not numeric, NA's are induced:
# Example with second underscore removed from filenames
V1 V2
1 control_1TINY.cleaned.forward 2719
2 control_2TINY.cleaned.forward 2695
3 experimental_3TINY.cleaned.forward 2682
4 experimental_4TINY.cleaned.forward 2684
# Split on underscore
V1 V2 X1 X2
1 control_1TINY.cleaned.forward 2719 control 1TINY.cleaned.forward
2 control_2TINY.cleaned.forward 2695 control 2TINY.cleaned.forward
3 experimental_3TINY.cleaned.forward 2682 experimental 3TINY.cleaned.forward
4 experimental_4TINY.cleaned.forward 2684 experimental 4TINY.cleaned.forward
# Coercing column X2 to numeric induces NA
# and prevents rbind with complete_table dataframe
V1 V2 X1 X2
1 control_1TINY.cleaned.forward 2719 control NA
2 control_2TINY.cleaned.forward 2695 control NA
3 experimental_3TINY.cleaned.forward 2682 experimental NA
4 experimental_4TINY.cleaned.forward 2684 experimental NA
I have a fix that allows more flexibility in the filenames. Are you open to pull requests?
Fix by @lisakmalins added!
Hello,
I had an error when I try to run DESeq_stats through master_script.sh Steps 1 to 5 went perfectly but when it gets to step 6 I had the following error:
[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename" Working directory is /home/samsa2/output/step_5_output/RefSeq_results/org_results Error in match.names(clabs, names(xi)) : names do not match previous names Calls: rbind ... eval -> eval -> eval -> rbind -> rbind -> match.names In addition: Warning message: NAs introduced by coercion Execution halted 'Rscript /home/samsa2/R_scripts/run_DESeq_stats.R -I /home/samsa2/output/step_5_output/RefSeq_results/org_results -O RefSeq_org_DESeq_results.tab -R /home//samsa2/output/step_2_output/raw_counts.txt' exited with non-zero status 1
I think the problem is that control files doesn't have the same column names but as I run the master_script.sh I could not figure out what went wrong. Would you guide me through this?
Thanks!! Raquel