Deseq2 colname matching error

albertdyu commented 1 year ago

Hi there,

Even without differential expression, this is a very nice and impressive tool you've developed! Thank you for your efforts!

I'm guessing I could just feed the DoG tables into Deseq2 myself, but for the read-through analyses, it certainly feels like it'd be better to do it your way.

Anyways, the problem arises when I try to run either of the diff_exp modes. I get this error:

ARTDeco -mode diff_exp_dogs -bam-files-dir ./BAMFiles Running diff_exp_dogs mode... Loading ARTDeco file structure...

Reformatted meta file exists... Reformatted comparisons file exists... ARTDeco will generate the following files: ./diff_exp_dogs/60uM_KL1_10HS-DMSO_NHS-results.txt ./diff_exp_dogs/20uM_KL1_10HS-60uM_KL1_10HS-results.txt ./diff_exp_dogs/20uM_KL1_NHS-DMSO_NHS-results.txt ./diff_exp_dogs/60uM_KL1_10HS-DMSO_10HS-results.txt ./diff_exp_dogs/20uM_KL1_NHS-DMSO_10HS-results.txt ./diff_exp_dogs/20uM_KL1_NHS-60uM_KL1_NHS-results.txt ./diff_exp_dogs/60uM_KL1_NHS-DMSO_NHS-results.txt ./diff_exp_dogs/20uM_KL1_NHS-60uM_KL1_10HS-results.txt ./diff_exp_dogs/20uM_KL1_10HS-DMSO_NHS-results.txt ./diff_exp_dogs/60uM_KL1_NHS-DMSO_10HS-results.txt ./diff_exp_dogs/DMSO_10HS-DMSO_NHS-results.txt ./diff_exp_dogs/20uM_KL1_10HS-DMSO_10HS-results.txt ./diff_exp_dogs/20uM_KL1_10HS-20uM_KL1_NHS-results.txt ./diff_exp_dogs/60uM_KL1_10HS-60uM_KL1_NHS-results.txt ./diff_exp_dogs/20uM_KL1_10HS-60uM_KL1_NHS-results.txt Running DESeq2 on DoGs... /home/adyu/.conda/envs/ARTDeco/lib/python3.6/site-packages/rpy2/rinterface/init.py:146: RRuntimeWarning: Error in .validate_names(colnames, ans_colnames, "assay colnames()", "colData rownames()") : assay colnames() must be NULL or identical to colData rownames()

warnings.warn(x, RRuntimeWarning)

My meta.reformatted.txt looks like this:

Experiment Group 20uM_10HS_Rep1_S9 20uM_KL1_10HS 20uM_10HS_Rep2_S10 20uM_KL1_10HS 20uM_NHS_Rep1_S3 20uM_KL1_NHS 20uM_NHS_Rep2_S4 20uM_KL1_NHS 60uM_10HS_Rep1_S11 60uM_KL1_10HS 60uM_10HS_Rep2_S12 60uM_KL1_10HS 60uM_NHS_Rep1_S5 60uM_KL1_NHS 60uM_NHS_Rep2_S6 60uM_KL1_NHS DMSO_10HS_Rep1_S7 DMSO_10HS DMSO_10HS_Rep2_S8 DMSO_10HS DMSO_NHS_Rep1_S1 DMSO_NHS DMSO_NHS_Rep2_S2 DMSO_NHS

and the top row of all_dogs.raw.txt, which I hope will suffice for an example, looks like this:

ID Length 20uM_10HS_Rep1_S9 20uM_10HS_Rep2_S10 20uM_NHS_Rep1_S3 20uM_NHS_Rep2_S4 60uM_10HS_Rep1_S11 60uM_10HS_Rep2_S12 60uM_NHS_Rep1_S5 60uM_NHS_Rep2_S6 DMSO_10HS_Rep1_S7 DMSO_10HS_Rep2_S8 DMSO_NHS_Rep1_S1 DMSO_NHS_Rep2_S2

I understand this is an issue with Deseq2 not matching the columns in the metafile to the row in the counts file, but I cannot, for the life of me figure out where the discrepancy is! I understand Deseq2 isn't your purview, but perhaps you might have some insight anyways?

Cheers,

Albert

sjroth commented 1 year ago

Hi Albert,

I would programmatically ensure that your experiment names match perfectly. Then, I would re-run the command with an overwrite specified. If this still throws the same error, I can inspect the files to see if there is a bug.

The other thing to ask yourself is if you actually care about differentially expressed DoGs. This was added as an extra feature but I have never used it in a readthrough context.

albertdyu commented 1 year ago

Thanks for the response! I figured it out - and it was something dumb indeed. I shouldn't have started the filenames with a number - R automatically stuck an X in front of it so that it parsed the column name correctly. I added X's in front of the numbers to the metafile and now it works properly.

As for the differentially expressed DoGs function, I have yet to compare it to the read-through analysis, but it seems, at the very least, notable to me! DoGs are expressed in some conditions and suppressed in others. If I understood your work correctly, I expect I will find that read-through and read-in analysis to be more biologically meaningful in the end.

Thank you for this fine piece of software!

sjroth commented 1 year ago

No worries! With R (and especially rpy), the syntax is going to be weird, dumb, and finicky. R is great for data manipulation but it is not the best language for pure scripting! I'm glad you could figure this out.

The reason that I am skeptical about a differential DoG analysis is because there are many assumptions that go into DoG comparisons. Chief among these is how to define a DoG across experiments. I take the longest DoG to be the consensus. Further, because most assays favor mature mRNAs and DoG transcripts aren't stably processed, the measurements are noisier. In my opinion, I've had far more success with readthrough levels as a comparison tool. In fact, that's the reason I developed the measure at my former advisor's insistence. Overall, I think it is a better, more interpretable measure that really does a great job distinguishing experiments.

albertdyu commented 1 year ago

Your insight is very helpful, and I expect I would very much like to pick your brain further!

Your concerns make sense and I agree with you - but I'm actually running ARTDeco on different types of nascent RNA sequencing experiments - NET-Seq, Pro-Seq, and Nascent-seq! I've been manually validating the DoGs defined by ARTDeco just by examining them in IGV, and in my opinion, the DoGs defined by ARTDeco are highly convincing!

sjroth commented 1 year ago

Happy to schedule a call if you would like. My email is sjroth@eng.ucsd.edu. However, be forewarned, I work in industry now so I'll be expecting publication credit should any work proceed.

sjroth / ARTDeco

Deseq2 colname matching error #13