smithlabcode / dnmtools

Tools for analyzing DNA methylation data
https://dnmtools.readthedocs.io
GNU General Public License v3.0
31 stars 9 forks source link

merge -t header seems to be not properly formatted #178

Closed moqri closed 1 year ago

moqri commented 1 year ago

Not a major issue but just thought to share in case:

It seems the tabular merge header is formatted in a way that when the merged file is open with other programs, it detects the first sample as the ID and it misses the last sample in the header. This is an example of two samples ("y" and "o"):

image

andrewdavidsmith commented 1 year ago

Thanks @moqri At one point too much time was spent on this topic. We designed it to work with R data frames, for example:

X <- read.table("merged_counts.txt", header=TRUE)

which requires that we do not have a column header for the first column, or else (I think) it will be used as a data column, and R will add auto-numbers as row labels, instead of using the sites names as row labels.

I think R would allow a tab or spaces before the first column name and would still load properly. I wonder if you are trying to load into Excel? Might be a very easy fix on our end to make things more convenient. I know that if these files are many GB in size, then fixing the header by opening in in vim can also be a problem.

moqri commented 1 year ago

Thank you @andrewdavidsmith , I encounter this header issue when I load the file in Python (Pandas). Adding a label for the index (first) column in your code might fix it for python, R, and excel.

andrewdavidsmith commented 1 year ago

That would break ordinary R, by adding an extra column (the first one of which can be expensive), but it's very easy for us to add an option to do what you suggest, which we will do. Probably today, in the repo, but it might take a couple weeks for this to be in a full new release.