iTaxoTools / TaxI2-legacy

Calculates genetic differences between DNA sequences
GNU General Public License v3.0
0 stars 0 forks source link

Implement "spart" output for clustering results #33

Closed mvences closed 3 years ago

mvences commented 3 years ago

"spart" stands for "species partition" and is a standardized format that we are proposing to report the results of grouping samples (specimens of organisms, usually represented by DNA sequences) into subsets (often species). For the clustering results the file should look as follows. The spart should be a separate text output file. It should be named taxi2_cluster_DATEANDTIME.spart (DATEANDTIME would be a "timestamp" giving the system date and time when the clustering has been completed.

The spart file itself would be as in the attached PDF where the text in red needs to be adapted according to the results of the clustering. Note that different "command blocks" always need to be ended with a semicolon. In the file, the "timestamp" (which needs to be in ISO format but can be less precise than in the example, for instance the following would be possible: "2020-09-21T07:26:10".

Under "N_spartitions" you give the number of clusters that resulted from the analysis, and the name of this "species partition" will be the name of the original input file used, but replacing all non-standard characters by underscores (only the characters 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz are allowed).

Under N_individuals the spart file reports the number of samples (sequences) that went into the analysis.

Under N_subsets the spart file reports the number of clusters that were found in the analysis.

All information in brackets is just textual information for the user and will be ignored by spart parsers.

And, under "Individual_assignment" the spart file reports in one new line each individual (sequence) in the analysis, and the cluster to which it has been assigned. Note the semicolon after the last line. Here, also the individual names should be adapted so they only consist of the allowed characters 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz.

spart_cluster_example.pdf