Think how to handle naming of sequences in get_unique_sequences

philliplab / ViralHaplotyper

0 stars 0 forks source link

Think how to handle naming of sequences in get_unique_sequences #32

Open philliplab opened 10 years ago

philliplab commented 10 years ago

When the unique sequences of a haplotype are retrieved, how should they be named?

Currently the name is chosen at random from all the different identical sequences.

A better naming convention will probably be: haplotype_name_seq_id_freq

Where:

haplotype_name is assigned when the haplotype is constructed (currently the first n letters of a random sequence in the haplotype)
seq_id is just a sequential identifier
freq is the number of copies of this sequence

I am not so sure about adding the frequency data to the name of the sequence. It feels inappropriate. We need to examine the downstream use cases and see think carefully about how to handle this.

ColinAnthony commented 10 years ago

I think the following naming convention is best: (we can specify a user input at the fist step, to be used in the labeling)

PatientID_visitID_time_Gene/region_uniqueID_count_freq IE: CAP177_2000_004wpi_v1v3_x_2000_23

(where x = a sequential id to ensure that all sequences have a unique label which is essential If it is not too time consuming, this numbering can be incremented from 1, and in order of most frequent to lease frequent haplotype, with haplotypes containing identical frequencies ordered at random)

We have to have the patient ID and time point ID in the label. Adding a 'haploype' label or first n letters if the sequence is not necessary as this will not add any useful information.

philliplab commented 10 years ago

I am just getting uncomfortable with the amount of data that is getting transmitted through the label. It feels to me like something like this:

A label consistenting of seq_x where x is an integer that uniquely identifies the sequence
together with a csv file containing columns like seq_id, pat_id, time_point, ... is a much better design.

Then there is a very clear and easy to use way in which data is transferred - you don't have to parse the sequence label everytime. I will probably move to structures like this internally and then just retain the option of outputting file with complex composite names like that. But I need to do some more thinking and look at how this data will get used in the next analysis steps.