pachterlab / seqspec

machine-readable file format for genomic library sequence and structure
MIT License
109 stars 17 forks source link

format chromap potentially is not deterministic as python set() does not have a guaranteed order, #36

Closed detrout closed 3 months ago

detrout commented 4 months ago

I had written a notebook to generate a bunch of seqspecs and I had decided to call run_index() to generate the index arguments to make it easier to review that the seqspec was generating the expected index values.

However, each time I ran the notebook, some of the chromap filename arguments moved around some

The diffs showing the inconsistent filename ordering is here:

https://github.com/detrout/y2ave_seqspecs/commits/main/all_arguments.tsv

I believe what's happening is that the order of the sets at https://github.com/pachterlab/seqspec/blob/main/seqspec/seqspec_index.py#L309 isn't guaranteed to have a fixed order,

    read1_fq = list(set(gdna_fqs))[0]
    read2_fq = list(set(gdna_fqs))[1]

however I think it's disorder comes from memory allocation so may depends some on how much management is going on, so may not show up without having called the function in a loop after doing a bunch of parsing. (Worse there's a slim chance read1_fq and read2_fq could end up equal if the two different calls to set() generated different orders.

I wrote a function that is guaranteed to maintain the order of the fastqs and remove duplicates

modified   seqspec/seqspec_index.py
@@ -285,6 +285,17 @@ def format_zumis(indices, subregion_type=None):
     return "\n".join(xl)[:-1]

+def stable_deduplicate_fqs(fqs):
+    # stably deduplicate gdna_fqs
+    seen_fqs = set()
+    deduplicated_fqs = []
+    for r in fqs:
+        if r not in seen_fqs:
+            deduplicated_fqs.append(r)
+            seen_fqs.add(r)
+    return deduplicated_fqs
+
+
 def format_chromap(indices, subregion_type=None):
     bc_fqs = []
     bc_str = []
@@ -306,8 +317,9 @@ def format_chromap(indices, subregion_type=None):
         raise Exception("chromap only supports genomic dna from two fastqs")

     barcode_fq = bc_fqs[0]
-    read1_fq = list(set(gdna_fqs))[0]
-    read2_fq = list(set(gdna_fqs))[1]
+    deduplicated_gdna_fqs = stable_deduplicate_fqs(gdna_fqs)
+    read1_fq = deduplicated_gdna_fqs[0]
+    read2_fq = deduplicated_gdna_fqs[1]
     read_str = ",".join([f"r{idx}:{ele}" for idx, ele in enumerate(gdna_str, 1)])
     bc_str = ",".join(bc_str)
sbooeshaghi commented 3 months ago

Fixed in recent PR to devel.