lindsayrutter / bigPint

Bioconductor package that makes BIG data pint-sized.
https://lindsayrutter.github.io/bigPint/
20 stars 7 forks source link

Formatting dataset for bigPint #7

Closed kem823 closed 4 years ago

kem823 commented 4 years ago

Hi,

I found your paper in BMC Bioinformatics recently while looking for a better way to visually display my RNA-Seq data. It was really informative and helpful, since I am currently a Master's student/lab tech, and this is my first experience going in-depth with my own RNA-Seq project.

I was testing out some static scatterplots with my data, which worked perfectly with your example dataset. Here is my code for the scatterplot, following your tutorial: gwi_plot = plotSM(gwi_data, gwi_metrics, option = "foldChange", threshFC = 1.2, pointSize = 0.2, saveFile = FALSE)

This seems like a simple question, but I keep getting this error: Error in helperTestData(data) : First column of data object must be of class 'character'. However, I have triple-checked, and I have my data frame formatted as you described, plus the ID column is in fact a character vector.

Do you have any thoughts or guidance? I'm happy to provide any more information that might be helpful.

Best, Katie

kem823 commented 4 years ago

Quick follow-up, since I've continued to play around with it in the meantime.

I was able to reimport the data with the Entrez IDs as the row names (the way that your example data set looks), which seems to have solved the character issue for some reason. However, I am now getting the error: Error in helperTestDataMetrics(data, dataMetrics, threshVar) : At least one column in each list element in the data metrics object should have the same name as the threshVar object.

My data metrics columns are as follows: ID, RPKM, logFC, FC, PValue. I can understand it not recognizing RPKM or FC, as these were not included in your example, and I wasn't sure that they would work anyway. However, logFC and Value seem like they should be enough.

Katie

lindsayrutter commented 4 years ago

Hello Katie:

Thank you for your questions here. Yes, I would be happy to help figure out the root cause of the issue.

1) For your first comment (with the error "First column of data object must be of class 'character'."), can you reply with the structure of both your gwi_data and gwi_metrics objects? (i.e. the output of running str(gwi_data) and str(gwi_metrics)).

2) For your second comments (with the error "At least one column in each list element in the data metrics object should have the same name as the threshVar object."), I am assuming you are still running the command:

gwi_plot = plotSM(gwi_data, gwi_metrics, option = "foldChange", threshFC = 1.2, pointSize = 0.2, saveFile = FALSE)

If this is the case, it is likely because the plotSM() command uses "FDR" as the default name for the threshVar object. We can see this by running the help command (??plotSM). It states that:

threshVar
CHARACTER STRING | 
Name of column in dataMetrics object that is used to threshold significance; 
default "FDR"; 
used in all options

So, to fix this error, you may want to simply specify the input parameter threshVar so that it is equal to a significance-like column name that is in your gwi_metrics object (such as "PValue"). So, your new command could look like:

gwi_plot = plotSM(gwi_data, gwi_metrics, threshVar = "PValue", option = "foldChange", threshFC = 1.2, pointSize = 0.2, saveFile = FALSE)

Note that the threshVar variable has a corresponding threshVal variable. The threshVal variable has a default value of 0.05. So, you could, say, lower the P-value you threshold (from 0.05 to 0.01, for instance) with the following command:

gwi_plot = plotSM(gwi_data, gwi_metrics, threshVar = "PValue", threshVal = 0.01, option = "foldChange", threshFC = 1.2, pointSize = 0.2, saveFile = FALSE)

Another option if you are only interested in fold-change (and not significance) is to simply omit the gwi_metrics object all together as follows:

gwi_plot = plotSM(gwi_data, option = "foldChange", threshFC = 1.2, pointSize = 0.2, saveFile = FALSE)

You can see examples for these two options for fold change scatterplot matrices (including significance values or not) in the last two plots of this section.

kem823 commented 4 years ago

Hi,

Here are the structures of both my data and data metrics objects:

> str(gwi_data)

'data.frame': 25676 obs. of 13 variables: $ ID : chr "4933401J01Rik" "Xkr4" "Gm37180" "Gm37363" ... $ A.92 : num NA 1.97 NA NA NA ... $ A.93 : num NA 3.99 NA NA NA ... $ A.94 : num NA 2.41 NA NA NA ... $ A.95 : num NA 2.78 NA NA NA ... $ A.96 : num NA 5.15 NA NA NA ... $ A.97 : num NA 2.44 NA NA NA ... $ B.101: num NA 2.99 NA NA NA ... $ B.102: num NA 3.63 NA NA NA ... $ B.103: num NA 4.42 NA NA NA ... $ B.104: num NA 4.3 NA NA NA ... $ B.105: num NA 3.96 NA NA NA ... $ B.106: num NA 3.05 NA NA NA ...

> str(gwi_metrics)

List of 1 $ A_B:'data.frame': 25676 obs. of 5 variables: ..$ ID : chr [1:25676] "4933401J01Rik" "Xkr4" "Gm37180" "Gm37363" ... ..$ RPKM : num [1:25676] 0 3.72 0 0 0 ... ..$ logFC : num [1:25676] NA 0.322 NA NA NA ... ..$ FC : num [1:25676] NA 1.25 NA NA NA ... ..$ Pvalue: num [1:25676] NA 0.12 NA NA NA ...

As I mentioned, I seem to have solved my issue with gwi_data by adding a column of row names.

I also tried your suggestion about specifying threshVar and threshVal, but I am still getting the same error: Error in helperTestDataMetrics(data, dataMetrics, threshVar) : At least one column in each list element in the data metrics object should have the same name as the threshVar object..

Thanks for your help, Katie

lindsayrutter commented 4 years ago

Hello Katie:

I think there may be two issues here.

1) The column names are case-sensitive. I think you initially wrote that the column name in your gwi_metrics object was "PValue". However, in your most recent post, it seems the column name is "Pvalue" (lower case 'v'). So, hopefully, if you rerun the commands I recommended specifying threshVar as "Pvalue", that error should no longer pop up.

2) Upon fixing just that, though, you may get a new error "Error in seq.default(minLine, maxLine, inc) : 'to' must be a finite number". If you see this error, it is because of the NA values in your data and data metrics objects. I would recommend removing the NA rows using code like follows:

gwi_data = gwi_data[complete.cases(gwi_data), ]
gwi_metrics[[1]] = gwi_metrics[[1]][complete.cases(gwi_metrics[[1]]), ]

Hopefully, this helps solve the problem or at least get closer to the solution. Please let me know how it goes. Thank you!

kem823 commented 4 years ago

Hi,

I got it to work! Thanks for your advice.

For some reason, my original import file for gwi_metrics (called gwi_metrics_in but later renamed A_B as an item in the list gwi_metrics) has the correct column name ("PValue" vs. "Pvalue"), but you were correct that it somehow was changed when I made it into a list of data frames. I didn't notice it because the original data frame retained the correct column name after import. Not sure how that happened, but regardless, I know to check on it in the future.

I did get the new error that you mentioned but was able to fix that as well following your recommendation.

bigPint is a great tool that's easy to use (even for a student), so I'm excited to play around with it a little more. Thanks again for your responsiveness and clear explanations to resolve my issues.

Best, Katie

lindsayrutter commented 4 years ago

Hello Katie:

Glad that resolved the issue! Great to know that bigPint has been easy to use so far. If you run into any other issues or see areas for improvement, feel free to let me know! Thank you.