How to output FusionCatcher predicted gene fusions in BEDPE format?

SysLuke commented 3 weeks ago

Dear ndaniel,

I am using FusionCatcher for gene fusion prediction and would like to output the predicted results in BEDPE format. Is there any existing command-line option or configuration to directly convert FusionCatcher's output into BEDPE format?

If FusionCatcher does not provide this feature, could you suggest any methods to convert the final-list_candidate-fusion-genes.txt file into BEDPE format?

Thank you for any advice or solutions you can provide.

Best regards, Luke

NadineWolgast commented 1 week ago

Dear Luke,

FusionCatcher does not natively support output in BEDPE format. However, you can convert the final-list_candidate-fusion-genes.txt file into BEDPE format using a Python script. Below is a method to achieve this:

Script: `convert_to_bedpe.py`

This script reads the final-list_candidate-fusion-genes.txt file produced by FusionCatcher and converts it into BEDPE format. It extracts key columns such as chromosome, start/end positions, strand, and fusion gene names to construct the BEDPE file.

#!/usr/bin/env python3

import csv
import sys

def convert_to_bedpe(input_file, output_file):
    """
    Converts FusionCatcher's output into BEDPE format.

    :param input_file: Path to FusionCatcher's `final-list_candidate-fusion-genes.txt`.
    :param output_file: Path to the output BEDPE file.
    """
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        reader = csv.reader(infile, delimiter='\t')
        writer = csv.writer(outfile, delimiter='\t')

        # Write BEDPE header
        writer.writerow(["chrom1", "start1", "end1", "chrom2", "start2", "end2", "name", "score", "strand1", "strand2"])

        # Skip header line of FusionCatcher file
        header = next(reader)

        for line in reader:
            # Skip invalid rows
            if len(line) < 10:
                print(f"Skipping invalid row: {line}")
                continue

            try:
                # Extract relevant data
                gene1 = line[0]
                gene2 = line[1]
                chrom1, pos1_str, strand1 = line[8].split(':')  # Fusion_point_for_gene_1
                chrom2, pos2_str, strand2 = line[9].split(':')  # Fusion_point_for_gene_2
                pos1 = int(pos1_str)
                pos2 = int(pos2_str)

                # Define BEDPE fields
                start1 = pos1 - 1
                end1 = pos1
                start2 = pos2 - 1
                end2 = pos2
                fusion_name = f"{gene1}-{gene2}"
                score = 0  # Placeholder for score

                # Write the BEDPE line
                writer.writerow([chrom1, start1, end1, chrom2, start2, end2, fusion_name, score, strand1, strand2])

            except ValueError as e:
                print(f"Skipping row due to error: {e}")
                continue

# Main execution
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python3 convert_to_bedpe.py <input_file> <output_file>")
        sys.exit(1)

    input_file = sys.argv[1]
    output_file = sys.argv[2]
    convert_to_bedpe(input_file, output_file)
    print(f"Conversion complete. BEDPE file saved to {output_file}")

Best, Nadine

SysLuke commented 5 days ago

Dear Ndaniel, Thank you for your detailed response regarding the conversion of FusionCatcher's output into BEDPE format. The Python script approach works effectively, and I appreciate your guidance on this matter. I have encountered an issue while running FusionCatcher, specifically at step 211 involving the find_homolog_genes.py script. Below is the command and error details for your reference: find_homolog_genes.py \ --input /home/luke/test-result/P6T_FRRL210062467-1a/reads_filtered_all-possible-mappings-transcriptome_multiple_sorted.map \ --reads 1 \ --input_exons /home/luke/biosoft/fusioncatcher/data/current/exons.txt \ --filter /home/luke/biosoft/fusioncatcher/data/current/custom_genes_mark.txt \ --processes 16 \ --output /home/luke/test-result/P6T_FRRL210062467-1a/list_candidates_ambiguous_homologous_genes_2.txt
Error: The process is killed before completion, and the size of the output file is 0 bytes. The input file /home/luke/test-result/P6T_FRRL210062467-1a/reads_filtered_all-possible-mappings-transcriptome_multiple_sorted.map is quite large (approximately 3.3 GB). I suspect that this might be related to system resource limitations (e.g., memory or CPU constraints). I’ve tried reducing the number of processes to mitigate this but still faced issues. Could you suggest any alternative approaches or adjustments that might help resolve this problem? Additionally, is there a way to restart the workflow from step 211 without rerunning the previous steps? Thank you again for your invaluable support. 6a8c276635e1f2c9dba3ff7f790c6d8

Best regards, Luke

NadineWolgast commented 5 days ago

Dear Luke,

I'm not Ndaniel, just a FusionCatcher User ;) You can restart the workflow by using the --start=START_STEP parameter. so in your case --start=211. Concerning your Error I would suggest to monitor your system usage during the process (using htop or top) to confirm whether memory or CPU is the bottleneck. If so, consider temporarily increasing swap space or limiting the number of simultaneous processes further. If that doesn't help, you could be using a tool like split or a custom Python script, to divide the large input file into smaller chunks and process them sequentially. Afterward, merge the outputs.

Best, Nadine

SysLuke commented 5 days ago

Dear Nadine, Thank you for your helpful suggestions regarding the FusionCatcher workflow. I appreciate your advice on monitoring system usage and considering solutions like increasing swap space or splitting the input file. Regarding the swap space, I previously ran the process successfully using 16 cores, but I suspect I might have deleted the swap space settings in WSL, which is causing interruptions at step 211. Could you advise on an optimal swap space size? Additionally, any tips on reconfiguring swap space in WSL would be greatly appreciated. Thank you once again for your assistance! Best regards, Luke

NadineWolgast commented 5 days ago

Dear Luke, Thank you for your message. I'm glad to hear that my suggestions were helpful! While I don’t have direct experience with WSL, I can guide you on reconfiguring swap space within WSL based on general principles and available documentation.

As a general rule of thumb, the swap size should be (https://help.ubuntu.com/community/SwapFaq): Equal to the size of your RAM for systems with ≤ 8 GB of RAM. Half the size of your RAM for systems with > 8 GB of RAM (e.g., 8 GB swap for 16 GB RAM).

For your current workload, given that the input file is 3.3 GB and you're running processes with 16 cores, I recommend setting the swap space to at least 8-16 GB for optimal performance.

Here’s how you can recreate and configure swap space in WSL (based on https://learn.microsoft.com/en-us/windows/wsl/wsl-config): The configuration file for WSL is located at: C:\Users\.wslconfig If it doesn’t exist, create it using Notepad or another text editor.

Add the following lines to the .wslconfig file, adjusting the swap size as needed: [wsl2] memory=8GB processors=16 swap=16GB swapFile=C:\Users\\swap.vhdx

After saving the changes, restart WSL for the new configuration to take effect. Run the following command in PowerShell: wsl --shutdown wsl

Inside WSL, verify the swap space configuration by running: free -h This should display the total memory and swap space available to WSL.

I hope these steps point you in the right direction for resolving the issue and configuring swap space in WSL. Unfortunately, as I’m not deeply familiar with it, I may not be able to assist further, but I encourage you to explore official documentation or reach out to experts in this area for more detailed guidance.

Best regards, Nadine

SysLuke commented 4 days ago

Dear Nadine, Thank you for your previous assistance. I followed your advice to adjust the swap space settings, but I'm still stuck at step 211 for 12 hours. The system monitoring shows high memory and CPU usage, but the process isn't progressing. Could you offer further advice? Are there additional methods to optimize performance or resolve this issue? Thank you very much for your support! Best regards, Luke 8c4be174ae58315439340656cb2b7c1

ndaniel / fusioncatcher