Question about sva correction: why does IsoformSwitchAnalyzeR tell me I have a lot of SVs but sva actually disagrees?

bozbezbozzel commented 2 months ago

Hi,

I'm following the steps in your vignette (great work by the way) for my data of about 100 NSCLC patients. With importRdata I keep receiving the warning though that sva finds too many sources of variation.

My data is fairly heterogeneous as you'd expect from patient samples from disease such as lung cancer, but I have a few things going for me namely that

these are all resected tumors that received little or no pre-treatment
I split them out by histological subtype since that is by far the largest source of variation in the gene level counts
I added sex and tumor stage as covariates since I know from my other analyses that these do add a little bit of variation
the quality of my samples is good and my regular RNA expression analyses are going well

Now out of interest I ran sva separately on my transcripts and gene-levels counts, imported with tximport and scaled in the case of transcripts. Using num.sv with be as the method (which is the default as far as I'm aware) gives me 3 SVs for the counts, 1 for the transcripts. That doesn't seem like it would warrant the warnings I keep getting.

I resorted to disabling sva with detectUnwantedEffects = FALSE but now I'm curious where this discrepancy could come from. I did also notice that the guesstimated dtu number is zero-- seems possible but not likely. I'm wondering if I'm doing something wrong and am just not noticing?

chunxubioinfor commented 1 month ago

Hi! I just checked the source code and yes ISAR first does log transformation and filtration on expression, then applies num.sv to estimate the number of SVs. So I guess the difference between results from ISAR and your own analysis might derive from the data process before the sva. Also the estimated DTU is zero is very weird. Could you share the conditions or comparisons of your analysis?

bozbezbozzel commented 1 month ago

Hi Chunxu, thanks for your reply. I'll try to replicate the data preprocessing to see if it makes a difference. My comparison is simple, it's samples that are infiltrated with CD8 cells versus samples that are CD8-excluded, about a 50/50 split in this dataset. Biologically it's not necessarily expected that there's a strong dtu difference, I thought it would be interesting to check. But to really have nothing significant at all seemed a bit odd to me.

bozbezbozzel commented 1 month ago

Just to add that I reran num.sv on logged and filtered abundances and counts, and the number of SVs did increase (22 for the counts, 25 for abundances-- doesn't seem crazy to me for patient data). Running the downstream code manually (checking whether they are not too highly correlated/diagonal) allows me to add SVs so I still don't understand where the errors are coming from. Happy to make a more formal bug report with everything I did.

kvittingseerup / IsoformSwitchAnalyzeR

Question about sva correction: why does IsoformSwitchAnalyzeR tell me I have a lot of SVs but sva actually disagrees? #238