joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
586 stars 186 forks source link

Very High Log2FoldChanges #1737

Open TamarSmulders opened 8 months ago

TamarSmulders commented 8 months ago

I am using nasal microbiome data for the deseq analysis in Phyloseq. This is not 16s data, but very similar. It is called IS-pro and instead of sequencing the 16s region it measures the length of 16s-21s interspace region to distinguish between species. I am using the exact code from the example with the kostic data set only with my own data. I am getting Log2 fold changes between 24-30 which seems too high.

Is my baseline data not suitable for these analysis or does it need to be preprocessed? I did notice that my abundance values are a lot higher than in the kostic otu_table.

This is my otu table otu_table_data6_github.csv

Results: baseMean log2FoldChange lfcSE stat pvalue padj Phylum 295 708.02157 30.00000 3.359762 8.929204 4.290856e-19 1.072714e-17 Firmicutes 855 1008.65757 28.81921 3.390482 8.500034 1.895359e-17 2.369199e-16 Proteobacteria 853 66.42386 25.01773 3.968248 6.304478 2.891663e-10 2.409720e-09 Proteobacteria 306 55.58879 24.77759 3.968280 6.243911 4.267619e-10 2.667262e-09 Firmicutes 440 62.36167 24.00495 3.968259 6.049239 1.455314e-09 6.063809e-09 Actinobacteria 852 33.13370 24.05514 3.968411 6.061655 1.347276e-09 6.063809e-09 Proteobacteria Class Order Family Genus 295 Bacilli Lactobacillales Streptococcaceae Streptococcus 855 Gammaproteobacteria Pasteurellales Pasteurellaceae Haemophilus 853 Gammaproteobacteria Pasteurellales Pasteurellaceae Haemophilus 306 Bacilli Lactobacillales Carnobacteriaceae Dolosigranulum 440 Actinobacteria Actinomycetales Corynebacteriaceae Corynebacterium 852 Gammaproteobacteria Pasteurellales Pasteurellaceae Haemophilus Species 295 Streptococcus pneumoniae/mitis 855 Haemophilus influenzae 853 Haemophilus influenzae 306 Dolosigranulum pigrum 440 Corynebacterium propiunquum 852 Haemophilus influenzae

spholmes commented 8 months ago

Hi, If you think you have a strong effect but don't know how to preprocess, I would suggest doing the rank threshold transformation, within each sample you rank the species from most abundant to near zero as done in the workflow paper : then you get results that are very robust to any type of transformations: https://f1000research.com/articles/5-1492/v2

we do this transformation for the PCA on ranks but doing it for testing is just as valid, good luck Susan

On Thu, Mar 21, 2024 at 1:39 PM TamarSmulders @.***> wrote:

I am using nasal microbiome data for the deseq analysis in Phyloseq. This is not 16s data, but very similar. It is called IS-pro and instead of sequencing the 16s region it measures the length of 16s-21s interspace region to distinguish between species. I am using the exact code from the example with the kostic data set only with my own data. I am getting Log2 fold changes between 24-30 which seems too high.

Is my baseline data not suitable for these analysis or does it need to be preprocessed? I did notice that my abundance values are a lot higher than in the kostic otu_table.

This is my otu table otu_table_data6_github.csv https://github.com/joey711/phyloseq/files/14695905/otu_table_data6_github.csv

Results: baseMean log2FoldChange lfcSE stat pvalue padj Phylum 295 708.02157 30.00000 3.359762 8.929204 4.290856e-19 1.072714e-17 Firmicutes 855 1008.65757 28.81921 3.390482 8.500034 1.895359e-17 2.369199e-16 Proteobacteria 853 66.42386 25.01773 3.968248 6.304478 2.891663e-10 2.409720e-09 Proteobacteria 306 55.58879 24.77759 3.968280 6.243911 4.267619e-10 2.667262e-09 Firmicutes 440 62.36167 24.00495 3.968259 6.049239 1.455314e-09 6.063809e-09 Actinobacteria 852 33.13370 24.05514 3.968411 6.061655 1.347276e-09 6.063809e-09 Proteobacteria Class Order Family Genus 295 Bacilli Lactobacillales Streptococcaceae Streptococcus 855 Gammaproteobacteria Pasteurellales Pasteurellaceae Haemophilus 853 Gammaproteobacteria Pasteurellales Pasteurellaceae Haemophilus 306 Bacilli Lactobacillales Carnobacteriaceae Dolosigranulum 440 Actinobacteria Actinomycetales Corynebacteriaceae Corynebacterium 852 Gammaproteobacteria Pasteurellales Pasteurellaceae Haemophilus Species 295 Streptococcus pneumoniae/mitis 855 Haemophilus influenzae 853 Haemophilus influenzae 306 Dolosigranulum pigrum 440 Corynebacterium propiunquum 852 Haemophilus influenzae

— Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/1737, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFZPPTWY3O67ESUTY2TKTYZLPIVAVCNFSM6AAAAABFBPKALWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIYDAMRSHE4TAOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Susan Holmes

TamarSmulders commented 8 months ago

Hi Susan,

Thanks for your swift reply! I read the article, very useful. I am trying to understand which transformation you mean. Is it this one?

PCA on ranks Microbial abundance data is often heavy-tailed, and sometimes it can be hard to identify a transformation that brings the data to normality. In these cases, it can be safer to ignore the raw abundances altogether, and work instead with ranks. We demonstrate this idea using a rank-transformed version of the data to perform PCA. First, we create a new matrix, representing the abundances by their ranks, where the microbe with the smallest in a sample gets mapped to rank 1, second smallest rank 2, etc.

I have performed it

abund <- otu_table(PhyloData6) abund_ranks <- t(apply(abund, 1, rank))

Output:

head(otu_table(PhyloData6)) OTU Table: [6 taxa and 100 samples] taxa are rows E1P0004V501T4 E1P0008V530T4 E1P0019V717T4 E1P0020V718T4 E1P0024V574T4 E1P0047V514T4 E1P0048V515T4 E1P0051V582T4II E1P0056V518T4 237 0 0 0 0 0 0 0 0 0 250 0 0 0 0 0 0 0 0 0 281 0 0 0 0 0 501 0 0 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 0 288 0 0 0 0 0 0 0 0 0 E1P0107V727T4 E1P0109V560T4 E1P0121V692T4 E1P0123V741T4 E1P0132V746T4 E1P0155V596T4 E1P0159V725T4 E1P0173V659T4 E1P0183V654T4 237 0 0 0 0 0 0 0 0 0 250 0 0 0 0 0 0 0 0 0 281 0 0 0 0 0 0 0 0 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 0 288 0 0 0 0 0 0 0 0 0 E1P0184V548T4 E1P0187V595T4 E1P0195V694T4 E1P0205V739T4 E1P0214V571T4PE E1P0222V533T4 E1P0226V667T4 E1P0233V710T4 E1P0234V626T4 237 0 0 0 0 0 0 0 0 0 250 0 0 0 0 0 0 0 0 0 281 0 0 0 0 0 0 0 0 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 0 288 0 0 0 0 0 0 0 0 0 E1P0261V669T4 E1P0269V722T4 E1P0274V679T4 E1P0291V588T4 E1P0294V629T4 E1P0300V591T4PE E1P0308V716T4 E1P0318V649T4 E1P0325V615T4 237 0 0 0 0 0 0 0 0 0 250 0 0 0 0 0 0 0 0 0 281 0 0 0 0 0 0 0 0 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 0 288 0 0 7175 0 0 0 0 0 0 E1P0333V564T4 E1P0337V592T4 E1P0347V743T4 E1P0377V632T4 E1P0386V599T4 E1P0392V634T4 E1P0409V608T4 E1P0410V607T4 E1P0414V660T4 237 0 0 0 0 0 0 0 0 0 250 0 0 0 0 0 0 0 0 0 281 0 0 0 0 0 0 0 0 0 283 0 0 0 0 0 0 0 859 0 287 0 0 0 0 0 0 0 0 0 288 0 0 0 0 0 0 0 0 0 E1P0415V633T4 E1P0431V616T4 E1P0462V589T4 E1P0464V620T4 E1P0465V508T4 E1P0476V675T4 E1P0486V700T4 E1P0489V742T4 E1P0493V552T4 237 0 0 0 0 0 0 0 0 0 250 0 0 0 0 0 0 0 0 0 281 0 0 0 0 0 0 0 0 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 0 288 0 0 0 0 0 0 0 0 0 E1P0495V639T4 E1P0496V605T4 E1P0500V593T4 E1P0522V540T4 E1P0527V606T4 E1P0531V602T4II E1P0535V701T4II E1P0538V672T4 E1P0569V524T4 237 0 918 0 0 0 0 0 0 0 250 0 0 0 0 0 0 0 0 0 281 0 0 0 0 0 0 0 658 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 0 288 0 0 0 0 0 0 0 0 0 E1P0575V601T4 E1P0577V644T4 E1P0590V604T4 E1P0593V650T4 E1P0627V681T4 E1P0634V631T4 E1P0651V613T4 E1P0682V610T4 E1P0683V624T4 237 0 0 0 0 0 0 0 0 0 250 0 0 0 0 0 504 0 0 0 281 0 0 0 807 0 0 0 0 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 0 288 0 0 0 0 0 0 0 0 0 E1P0720V640T4 E1P0751V712T4 E1P0754V720T4 E1P0755V647T4 E1P0789V686T4 E1P0832V655T4 E1P0833V603T4 E1P0851V597T4 E1P0861V709T4II 237 0 0 0 0 0 0 0 0 0 250 3023 0 0 0 0 0 0 0 0 281 0 0 0 0 0 0 0 0 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 0 288 0 0 2831 0 0 0 0 0 0 E1P0863V688T4 E1P0865V645T4II E1P0946V674T4 E1P0977V581T4 E1P1025V733T4 E1P1040V734T4 E1P1058V658T4 E1P1060V736T4 E1P1084V663T4 237 0 0 0 0 0 0 0 0 0 250 0 0 0 0 0 0 0 0 0 281 0 0 577 0 0 0 0 0 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 7131 288 0 0 5923 0 0 0 0 0 0 E1P1094V638T4 E1P1103V702T4 E1P1107V590T4 E1P1120V627T4 E1P1133V726T4 E1P1146V580T4 E1P1193V661T4 E1P1199V583T4 E1P1205V569T4PE 237 0 0 0 0 0 0 0 0 0 250 0 0 0 0 0 0 0 0 0 281 0 0 0 0 0 0 0 0 0 283 0 0 0 0 0 0 0 0 0 287 0 0 0 0 0 0 0 0 0 288 0 0 0 0 0 0 0 0 0 E1P1240V696T4 237 0 250 0 281 0 283 0 287 0 288 0

head(otu_table(PhyloData6Ranks)) OTU Table: [6 taxa and 100 samples] taxa are rows E1P0004V501T4 E1P0008V530T4 E1P0019V717T4 E1P0020V718T4 E1P0024V574T4 E1P0047V514T4 E1P0048V515T4 E1P0051V582T4II E1P0056V518T4 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 48.5 48.5 48.5 97.0 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P0107V727T4 E1P0109V560T4 E1P0121V692T4 E1P0123V741T4 E1P0132V746T4 E1P0155V596T4 E1P0159V725T4 E1P0173V659T4 E1P0183V654T4 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P0184V548T4 E1P0187V595T4 E1P0195V694T4 E1P0205V739T4 E1P0214V571T4PE E1P0222V533T4 E1P0226V667T4 E1P0233V710T4 E1P0234V626T4 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P0261V669T4 E1P0269V722T4 E1P0274V679T4 E1P0291V588T4 E1P0294V629T4 E1P0300V591T4PE E1P0308V716T4 E1P0318V649T4 E1P0325V615T4 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 100.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P0333V564T4 E1P0337V592T4 E1P0347V743T4 E1P0377V632T4 E1P0386V599T4 E1P0392V634T4 E1P0409V608T4 E1P0410V607T4 E1P0414V660T4 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 100.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P0415V633T4 E1P0431V616T4 E1P0462V589T4 E1P0464V620T4 E1P0465V508T4 E1P0476V675T4 E1P0486V700T4 E1P0489V742T4 E1P0493V552T4 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P0495V639T4 E1P0496V605T4 E1P0500V593T4 E1P0522V540T4 E1P0527V606T4 E1P0531V602T4II E1P0535V701T4II E1P0538V672T4 E1P0569V524T4 237 50.0 100.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 48.5 48.5 48.5 48.5 48.5 99.0 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P0575V601T4 E1P0577V644T4 E1P0590V604T4 E1P0593V650T4 E1P0627V681T4 E1P0634V631T4 E1P0651V613T4 E1P0682V610T4 E1P0683V624T4 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 99.0 49.5 49.5 49.5 281 48.5 48.5 48.5 100.0 48.5 48.5 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P0720V640T4 E1P0751V712T4 E1P0754V720T4 E1P0755V647T4 E1P0789V686T4 E1P0832V655T4 E1P0833V603T4 E1P0851V597T4 E1P0861V709T4II 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 100.0 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 98.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P0863V688T4 E1P0865V645T4II E1P0946V674T4 E1P0977V581T4 E1P1025V733T4 E1P1040V734T4 E1P1058V658T4 E1P1060V736T4 E1P1084V663T4 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 98.0 48.5 48.5 48.5 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 100.0 288 49.0 49.0 99.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P1094V638T4 E1P1103V702T4 E1P1107V590T4 E1P1120V627T4 E1P1133V726T4 E1P1146V580T4 E1P1193V661T4 E1P1199V583T4 E1P1205V569T4PE 237 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 250 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 281 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 283 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 287 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 288 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 49.0 E1P1240V696T4 237 50.0 250 49.5 281 48.5 283 50.0 287 50.0 288 49.0

This is what happens. Is that supposed to happen? Or are my rows and columns not right. Rows are "OTU's" and columns are samples

Thank you! Tamar