Open SysLuke opened 3 weeks ago
Dear Luke,
FusionCatcher does not natively support output in BEDPE format. However, you can convert the final-list_candidate-fusion-genes.txt
file into BEDPE format using a Python script. Below is a method to achieve this:
convert_to_bedpe.py
This script reads the final-list_candidate-fusion-genes.txt
file produced by FusionCatcher and converts it into BEDPE format. It extracts key columns such as chromosome, start/end positions, strand, and fusion gene names to construct the BEDPE file.
#!/usr/bin/env python3
import csv
import sys
def convert_to_bedpe(input_file, output_file):
"""
Converts FusionCatcher's output into BEDPE format.
:param input_file: Path to FusionCatcher's `final-list_candidate-fusion-genes.txt`.
:param output_file: Path to the output BEDPE file.
"""
with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
# Write BEDPE header
writer.writerow(["chrom1", "start1", "end1", "chrom2", "start2", "end2", "name", "score", "strand1", "strand2"])
# Skip header line of FusionCatcher file
header = next(reader)
for line in reader:
# Skip invalid rows
if len(line) < 10:
print(f"Skipping invalid row: {line}")
continue
try:
# Extract relevant data
gene1 = line[0]
gene2 = line[1]
chrom1, pos1_str, strand1 = line[8].split(':') # Fusion_point_for_gene_1
chrom2, pos2_str, strand2 = line[9].split(':') # Fusion_point_for_gene_2
pos1 = int(pos1_str)
pos2 = int(pos2_str)
# Define BEDPE fields
start1 = pos1 - 1
end1 = pos1
start2 = pos2 - 1
end2 = pos2
fusion_name = f"{gene1}-{gene2}"
score = 0 # Placeholder for score
# Write the BEDPE line
writer.writerow([chrom1, start1, end1, chrom2, start2, end2, fusion_name, score, strand1, strand2])
except ValueError as e:
print(f"Skipping row due to error: {e}")
continue
# Main execution
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python3 convert_to_bedpe.py <input_file> <output_file>")
sys.exit(1)
input_file = sys.argv[1]
output_file = sys.argv[2]
convert_to_bedpe(input_file, output_file)
print(f"Conversion complete. BEDPE file saved to {output_file}")
Best, Nadine
Dear Ndaniel,
Thank you for your detailed response regarding the conversion of FusionCatcher's output into BEDPE format. The Python script approach works effectively, and I appreciate your guidance on this matter.
I have encountered an issue while running FusionCatcher, specifically at step 211 involving the find_homolog_genes.py script. Below is the command and error details for your reference:
find_homolog_genes.py \
--input /home/luke/test-result/P6T_FRRL210062467-1a/reads_filtered_all-possible-mappings-transcriptome_multiple_sorted.map \
--reads 1 \
--input_exons /home/luke/biosoft/fusioncatcher/data/current/exons.txt \
--filter /home/luke/biosoft/fusioncatcher/data/current/custom_genes_mark.txt \
--processes 16 \
--output /home/luke/test-result/P6T_FRRL210062467-1a/list_candidates_ambiguous_homologous_genes_2.txt
Error:
The process is killed before completion, and the size of the output file is 0 bytes.
The input file /home/luke/test-result/P6T_FRRL210062467-1a/reads_filtered_all-possible-mappings-transcriptome_multiple_sorted.map is quite large (approximately 3.3 GB).
I suspect that this might be related to system resource limitations (e.g., memory or CPU constraints). I’ve tried reducing the number of processes to mitigate this but still faced issues.
Could you suggest any alternative approaches or adjustments that might help resolve this problem? Additionally, is there a way to restart the workflow from step 211 without rerunning the previous steps?
Thank you again for your invaluable support.
Best regards, Luke
Dear Luke,
I'm not Ndaniel, just a FusionCatcher User ;) You can restart the workflow by using the --start=START_STEP parameter. so in your case --start=211. Concerning your Error I would suggest to monitor your system usage during the process (using htop or top) to confirm whether memory or CPU is the bottleneck. If so, consider temporarily increasing swap space or limiting the number of simultaneous processes further. If that doesn't help, you could be using a tool like split or a custom Python script, to divide the large input file into smaller chunks and process them sequentially. Afterward, merge the outputs.
Best, Nadine
Dear Nadine, Thank you for your helpful suggestions regarding the FusionCatcher workflow. I appreciate your advice on monitoring system usage and considering solutions like increasing swap space or splitting the input file. Regarding the swap space, I previously ran the process successfully using 16 cores, but I suspect I might have deleted the swap space settings in WSL, which is causing interruptions at step 211. Could you advise on an optimal swap space size? Additionally, any tips on reconfiguring swap space in WSL would be greatly appreciated. Thank you once again for your assistance! Best regards, Luke
Dear Luke, Thank you for your message. I'm glad to hear that my suggestions were helpful! While I don’t have direct experience with WSL, I can guide you on reconfiguring swap space within WSL based on general principles and available documentation.
As a general rule of thumb, the swap size should be (https://help.ubuntu.com/community/SwapFaq): Equal to the size of your RAM for systems with ≤ 8 GB of RAM. Half the size of your RAM for systems with > 8 GB of RAM (e.g., 8 GB swap for 16 GB RAM).
For your current workload, given that the input file is 3.3 GB and you're running processes with 16 cores, I recommend setting the swap space to at least 8-16 GB for optimal performance.
Here’s how you can recreate and configure swap space in WSL (based on https://learn.microsoft.com/en-us/windows/wsl/wsl-config):
The configuration file for WSL is located at: C:\Users\
Add the following lines to the .wslconfig file, adjusting the swap size as needed:
[wsl2]
memory=8GB
processors=16
swap=16GB
swapFile=C:\Users\
After saving the changes, restart WSL for the new configuration to take effect. Run the following command in PowerShell: wsl --shutdown wsl
Inside WSL, verify the swap space configuration by running: free -h This should display the total memory and swap space available to WSL.
I hope these steps point you in the right direction for resolving the issue and configuring swap space in WSL. Unfortunately, as I’m not deeply familiar with it, I may not be able to assist further, but I encourage you to explore official documentation or reach out to experts in this area for more detailed guidance.
Best regards, Nadine
Dear Nadine, Thank you for your previous assistance. I followed your advice to adjust the swap space settings, but I'm still stuck at step 211 for 12 hours. The system monitoring shows high memory and CPU usage, but the process isn't progressing. Could you offer further advice? Are there additional methods to optimize performance or resolve this issue? Thank you very much for your support! Best regards, Luke
Dear ndaniel,
I am using FusionCatcher for gene fusion prediction and would like to output the predicted results in BEDPE format. Is there any existing command-line option or configuration to directly convert FusionCatcher's output into BEDPE format?
If FusionCatcher does not provide this feature, could you suggest any methods to convert the final-list_candidate-fusion-genes.txt file into BEDPE format?
Thank you for any advice or solutions you can provide.
Best regards, Luke