tseemann / snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment
GNU General Public License v3.0
127 stars 28 forks source link

Reduce molten output to unique pairs #47

Open amilesj opened 3 years ago

amilesj commented 3 years ago

Making a new request for the enhancement suggested in a previous comment (https://github.com/tseemann/snp-dists/issues/39#issuecomment-654909438) to make molten output only unique pairs of isolates.

idolawoye commented 1 year ago

Hi @amilesj I have the same issue. Were you able to find a way around getting only unique pair combinations in the molten output?

slbai01 commented 1 year ago

I write a python script, maybe you can try.

import argparse
from os import sep
import pandas as pd

def process_molten_file(molten_file, output_file):
    # Read the molten file into a DataFrame
    df = pd.read_csv(molten_file, sep = "\t", header=None)
    df.columns = ["Sample", "Pair", "Value"]
    # Ensure that the Pair column contains unique pairs
    df['Pair2'] = df.apply(lambda row: tuple(sorted([row['Sample'], row['Pair']])), axis=1)

    # Sort the DataFrame based on 'Value' column in descending order
    df = df.sort_values(by='Value', ascending=False)

    # Drop duplicates based on 'Pair' column, keeping the row with the maximum 'Value'
    unique_pairs_df = df.drop_duplicates(subset='Pair2', keep='first')

    # If you want to reset the index of the resulting DataFrame:
    unique_pairs_df = unique_pairs_df.reset_index(drop=True)

    # Save the output DataFrame to a TSV file
    unique_pairs_df.to_csv(output_file, sep='\t', index=False)
    print(f"Contents of the reduced molten file are saved to: {output_file}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Reduce molten output to unique pairs and keep max value.")
    parser.add_argument("molten_file", help="Path to the molten output file to be processed")
    parser.add_argument("output_file", help="Path to save the output in TSV format")

    args = parser.parse_args()
    molten_file_path = args.molten_file
    output_file_path = args.output_file

    process_molten_file(molten_file_path, output_file_path)