philliplab / ViralHaplotyper

0 stars 0 forks source link

Think how to handle naming of sequences in get_unique_sequences #32

Open philliplab opened 9 years ago

philliplab commented 9 years ago

When the unique sequences of a haplotype are retrieved, how should they be named?

Currently the name is chosen at random from all the different identical sequences.

A better naming convention will probably be: haplotype_name_seq_id_freq

Where:

I am not so sure about adding the frequency data to the name of the sequence. It feels inappropriate. We need to examine the downstream use cases and see think carefully about how to handle this.

ColinAnthony commented 9 years ago

I think the following naming convention is best: (we can specify a user input at the fist step, to be used in the labeling)

PatientID_visitID_time_Gene/region_uniqueID_count_freq IE: CAP177_2000_004wpi_v1v3_x_2000_23

(where x = a sequential id to ensure that all sequences have a unique label which is essential If it is not too time consuming, this numbering can be incremented from 1, and in order of most frequent to lease frequent haplotype, with haplotypes containing identical frequencies ordered at random)

We have to have the patient ID and time point ID in the label. Adding a 'haploype' label or first n letters if the sequence is not necessary as this will not add any useful information.

.

philliplab commented 9 years ago

I am just getting uncomfortable with the amount of data that is getting transmitted through the label. It feels to me like something like this:

Then there is a very clear and easy to use way in which data is transferred - you don't have to parse the sequence label everytime. I will probably move to structures like this internally and then just retain the option of outputting file with complex composite names like that. But I need to do some more thinking and look at how this data will get used in the next analysis steps.