Open ghost opened 10 years ago
Thanks for the detailed message. You are not alone. Many users have complained about this step. I also don't like this, and feel that merge_samples
is incomplete until it has better handling of discrete variables.
There is currently no "off the shelf" option to make sure that merge_samples
does not mangle discrete variables in a data.frame
. This is the same behavior that you get from merge
in base R on a data.frame
. You found some already-posted solutions for fixing this, and I must agree that I would also like to have some kind of data-preserving solution.
One obvious issue, though, is: How do you "merge" the entries of discrete variables when/if they differ? Of course when the entries are the same nothing needs to be done at all, but very often there are differences.
It might suffice to paste/collapse the unique vector of strings of the original entries. That's my best first guess, at a partial solution. This has the potential to expand the number of apparent categories, and so is not a general solution on its own, but might be a pragmatic first step that would work for many examples. There might be other issues that don't immediately come to mind as well. I suspect there is a reason that R's core merge
function coerces in-data.frame
factors to their underlying integer indices during the merge, and I think there is a similar coercion (or NA propagation?) for characters.
I will keep this open until I have at least attempted, or included, the solution I mentioned above.
Thanks for your feedback and interest in phyloseq!
I believe that my issue is related to threads elsewhere, but I am unable to resolve my problem independently, so I am seeking additional guidance. Please let me know if I have provided enough information about my issue.
My goal is to merge samples of the same level in one factor while retaining the integrity of levels on other factors so that I can do differential analyses on those factors.
In my attempts to do this using merge_samples(), I think I have created some messes. In short, I appear to be having a problem with levels of factors in my “sample_data”, similar to (https://github.com/joey711/phyloseq/issues/243).
To begin: (1) Null Values in sample_data?
When I view the first few lines of the import, things look OK:
However, when I specifically look for the levels, some columns do not appear to have levels assigned $Date, $Site, $Age, $AgeRange:
However, when I use factor to look at the values in $Age, levels are listed.
Why would they not also be in output for levels(sample_data(Piglet_Data)$Age)?
(2) I assume that the NULL values in the $Site factor are why this command is failing when I try to merge by $Site (since there are no levels being detected?):
(3) An issue similar to (https://github.com/joey711/phyloseq/issues/243), but my lack of experience is such that I am unable to take the solution from (https://github.com/joey711/phyloseq/issues/243) and apply it to my own work at the moment. I followed http://joey711.github.io/phyloseq-demo/Restroom-Biogeography to merge these data by a factor that appear to have levels in sample_data detected ($LocationCode). merged = merge_samples(Piglet_Data_RF, "LocationCode") But when I look at the first few lines of the sample_data, I noticed there are several changes to the variables:
The unmerged data look like:
I am able to successfully assign the levels on the factor from the original data to the new data, but when I try to do that for the other factors that got changed, I created a new problem:
It got rid of the factor that I tried to relabel (21 variables now instead of 22).
(4) The final issue that I have is that while I am able to relabel a couple of factors, there are other factors that cannot be relabeled:
Any insight or suggestions would be appreciated.