I had written a notebook to generate a bunch of seqspecs and I had decided to call run_index() to generate the index arguments to make it easier to review that the seqspec was generating the expected index values.
However, each time I ran the notebook, some of the chromap filename arguments moved around some
The diffs showing the inconsistent filename ordering is here:
however I think it's disorder comes from memory allocation so may depends some on how much management is going on, so may not show up without having called the function in a loop after doing a bunch of parsing. (Worse there's a slim chance read1_fq and read2_fq could end up equal if the two different calls to set() generated different orders.
I wrote a function that is guaranteed to maintain the order of the fastqs and remove duplicates
modified seqspec/seqspec_index.py
@@ -285,6 +285,17 @@ def format_zumis(indices, subregion_type=None):
return "\n".join(xl)[:-1]
+def stable_deduplicate_fqs(fqs):
+ # stably deduplicate gdna_fqs
+ seen_fqs = set()
+ deduplicated_fqs = []
+ for r in fqs:
+ if r not in seen_fqs:
+ deduplicated_fqs.append(r)
+ seen_fqs.add(r)
+ return deduplicated_fqs
+
+
def format_chromap(indices, subregion_type=None):
bc_fqs = []
bc_str = []
@@ -306,8 +317,9 @@ def format_chromap(indices, subregion_type=None):
raise Exception("chromap only supports genomic dna from two fastqs")
barcode_fq = bc_fqs[0]
- read1_fq = list(set(gdna_fqs))[0]
- read2_fq = list(set(gdna_fqs))[1]
+ deduplicated_gdna_fqs = stable_deduplicate_fqs(gdna_fqs)
+ read1_fq = deduplicated_gdna_fqs[0]
+ read2_fq = deduplicated_gdna_fqs[1]
read_str = ",".join([f"r{idx}:{ele}" for idx, ele in enumerate(gdna_str, 1)])
bc_str = ",".join(bc_str)
I had written a notebook to generate a bunch of seqspecs and I had decided to call run_index() to generate the index arguments to make it easier to review that the seqspec was generating the expected index values.
However, each time I ran the notebook, some of the chromap filename arguments moved around some
The diffs showing the inconsistent filename ordering is here:
https://github.com/detrout/y2ave_seqspecs/commits/main/all_arguments.tsv
I believe what's happening is that the order of the sets at https://github.com/pachterlab/seqspec/blob/main/seqspec/seqspec_index.py#L309 isn't guaranteed to have a fixed order,
however I think it's disorder comes from memory allocation so may depends some on how much management is going on, so may not show up without having called the function in a loop after doing a bunch of parsing. (Worse there's a slim chance read1_fq and read2_fq could end up equal if the two different calls to set() generated different orders.
I wrote a function that is guaranteed to maintain the order of the fastqs and remove duplicates