pjweggy / CLADES

9 stars 2 forks source link

SNP Input File #1

Open btmartin721 opened 5 years ago

btmartin721 commented 5 years ago

Hello,

For input files containing only SNPs, do I need to do anything different? Or Do I just put one site per locus as below:

4 1

ind1^pop1    A
ind2^pop1    A
ind3^pop2    T
ind4^pop2    T

Alternatively, I thought about using my whole loci (ddRADseq loci length = ~90-100 bp). But if a locus is missing an individual, it would contain missing data for all sites at that locus for that individual. Would that cause CLADES to crash? E.g.,

2 4

ind1^pop1    AAAA
ind2^pop2    ----

Finally, which method do you think would be better?

Thanks for your time.

SamuelLNT commented 3 months ago

Hi,

I am a PhD student at the University of Oulu working on butterflies. I am now trying to use CLADES for my ddRADseq dataset. My experience so far is a bit mysterious, and I think CLADES has some problems dealing with missing data. I am not a bioinformatician or a programmer, so please correct me if I have done something wrong below. I hope my experience using CLADES for SNPs data might help someone wanting to use CLADEs with theirs.

Background of my data

As a background, I have been using Ipyrad0.9.92 to process my data, so I have tried several ways to format my SNPs into flavourable input for CLADES. So far, the .loci file from the Ipyrad output works the best for me (I tried to use the single line SNPs phylip format as input, e.g. .snps / .phy, but it does not work, See point 1 below). The .loci file is basically a phylip file carrying all the aligned "loci" reconstructed from the dataset, and it corresponds to the %ID column in the VCF file written out by Ipyrad. Since the loci were reconstructed from a ddRADseq dataset, they are different in length, some loci have a lot of missing data (of course depends on the filtering), and some are invariant loci. However, from my experience, CLADEs can only run if all the alignments has at least one variant site because it relies on the "private position" to do the calculation (see point 2 below). Therefore, I have used the python script to split each locus into a separated .phy file. Then I used bcftools to filter out SNPs site with NS>50 (I have 55 samples in total), obtain the locus number passing the filter and reassemble these loci into the "phylip-like" input format for CLADES.

Scenarios / Problems I have encountered, and my solutions / hypothesis:

1. Not able to handle single line SNPs matrix When using single line alignment (e.g. .snps from ipyrad), CLADES will not work because it will return an error forcing you to use at least two alignments in the input_seq.txt. --> I tried to duplicate the alignment and put it in one file, it resolved the problem.

2. Alignments have to contain at least one SNP, heterozygous site might not work(?) I tried to pour all the loci from the ddRADseq for CLADES, it returned error saying some locus are identical. However, during manual inspection, some sites are marked as ambiguity site due to the heterozygosity (e.g. one individual is heterogeneous A/T at that size denoted W, CLADES seems to consider this as invariant site). --> I retained only the variant sites present in the VCF files, and filter the SNPs with NS>2 then it works.

3. Warning potentially related to missing data While I run the dataset from point 2, I have a long list of warning:

WARNING: original #nonzeros ###
       > new      #nonzeros ###
If feature values are non-negative and sparse, use -l 0 rather than the default -l -1

--> if I apply stricter filtering on the loci for input from the VCF file, I have less WARNINGs, therefore, I suspected that it should be related to missing data. I also did some search on this particular error, many people said it is related to something divided by 0.

4. Wrong input format CLADES sometime does not return meaningful error if it has problem reading the input file. It happens to me that I have this

File "CLADES.py", line 367, in <module>
    header=line.split()[0]
IndexError: list index out of range

--> apparently, four empty lines (\n)between each alignment is quite important. (It was coded like that in the CLADES.py) The following format will work (I put the "(\n)" just to emphasis the importance of these four empty lines, you should see nothing in the real input_seq.txt):

55 300
s1^p1 AAAAAAATTAAA......
...
s55^p1  AAAAAATTTAAA...
(\n)
(\n)
(\n)
(\n)
53 254
s1^p1 ------AAAATT....
...
s53^p1 --AATTTCCC...

5. Individual ID can not be too long If your individual ID is too long, CLADES can also crash. It is common in coding.

If you have any idea to avoid the warnings, or have ideas to improve the workflow, please let me know.

Cheers, Sam