shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

Typo affecting arithmetic in line 16 of src/merge_all.py #55

Closed lucaskampman closed 2 years ago

lucaskampman commented 2 years ago

Hi,

I think I found a small typo in line 16 of merge_all.py that caused the mpranalyze code to only process replicates 1 through (n-1) in a data set with n replicates.

My .command.sh file reads as follows, with arguments for each of my 3 replicates:

#!/bin/bash -ue
python /wynton/group/corces/user/lkampman/MPRAflow/src/merge_all.py K562 "K562_count.csv" K562_1_counts.csv K562_3_counts.csv K562_2_counts.csv 1 3 2

The following code in merge_all.py drops the last argument, only iterating over sys.argv[3] and sys.argv[4], even though sys.argv[5] should be included:

[...]

replicates=int((len(sys.argv)-3)/2)
for i in range(3,(len(sys.argv)-replicates-1)):
    file=sys.argv[i]
    rep=sys.argv[i+replicates]

    #DNA 1 (condition A, replicate 1)
    colnames=["Barcode", "DNA %s (condition %s, replicate %s)" % (rep,cond,rep),
                         "RNA %s (condition %s, replicate %s)" % (rep,cond,rep)]

[...]

A quick fix was replacing the for loop line above with "for i in range(3,(len(sys.argv)-replicates)):"

Hope that makes sense — let me know if there's anything I can clarify!

All the best, Lucas

visze commented 2 years ago

What a bummer!

Thanks a lot. Yes the -1 is wronger than wrong! I didn't recognize it, because I am using this script (which obviously has not the error):

import sys
import pandas as pd
import numpy as np
import dask.dataframe as dd

import click

# options
@click.command()
@click.option('--condition',
              required=True, 
              type=str,
              help='Name of the condition.')
@click.option('--counts',
              'counts',
              required=True,
              nargs=2,
              multiple=True,
              type=(str, click.Path(exists=True, readable=True)),
              help='Replicate name and Count file. Can be used multiple times')
@click.option('--output',
              'output_file',
              required=True,
              type=click.Path(writable=True),
              help='Output file.')
def cli(condition, counts, output_file):

    dk_full_df= None

    for replicate_count in counts:

        rep=replicate_count[0]
        file=replicate_count[1]

        #DNA 1 (condition A, replicate 1)
        colnames=["Barcode", "DNA %s (condition %s, replicate %s)" % (rep,condition,rep),
                             "RNA %s (condition %s, replicate %s)" % (rep,condition,rep)]
        cur=pd.DataFrame(pd.read_csv(file, sep='\t', header=None))
        print(cur.head())
        cur.columns=colnames
        cur_dk=dd.from_pandas(cur,npartitions=1)
        print(cur.head())

        if (dk_full_df is not None):

            tmp=dd.merge(dk_full_df,cur_dk, on=["Barcode"],how='outer')
            dk_full_df=tmp
        else:
            dk_full_df=cur_dk

        print(dk_full_df.head())

    dk_full_df=dk_full_df[sorted(dk_full_df.columns)]
    print(dk_full_df.head())

    dk_full_df.compute().to_csv(output_file, index=False)

if __name__ == '__main__':
    cli()

I will update it and create a new version release v2.3.2

visze commented 2 years ago

Just to frame the error. It appears only when count.nf is used the with the option--mpranalyze`