Custom Genome and Visualizations

singhbhavya commented 6 months ago

Hi there,

I apologize for the naive question I'm about to ask, but I've been struggling with this for a week and would appreciate some help. I created a custom gene index using a FASTA file of fusions genes (each line > is a gene) and then aligned reads to these fusion genes. I'd like to visualize the FASTA as a "genome" in IGV, along with read alignments to the genes. When I try to load the FASTA in IGV as a genome, I get the error that IGV cannot set the starting chromosome. When I try to import the BAM alignments, I get the error "Invalid BAM file header: missing sequence name in file". Could you please help me understand what I'm missing? Do I need to also provide an annotation to go with this custom "genome", describing what each gene is?

jrobinso commented 6 months ago

You should be able to load the fasta from the "Genome" menu, I don't understand the error you are getting. What version of IGV are you using?

By "import" bam alignments I assume you are loading the BAM file from the "File" menu, correct? The error message indicates there is something wrong with your BAM file.

If you are able to share these files (fasta, fasta index, bam, and bam index) email us at igv-team@broadinstitute.org and I can send you a secure dropbox link. But first confirm that you are using a recent version of IGV.

singhbhavya commented 6 months ago

Hi, thank you so much for the response! The version I am using is 2.3.98.

Yes, correct, I am loading the BAM file from the "File" menu. Please let me know whether or not I can email you - thank you again!

jrobinso commented 6 months ago

Sorry I can't provide any help for that version, it was released in 2017. You might try the latest version, 2.17.4. If you would like me to look at your files please send email to the address noted above for a dropbox link, or share them in some other way.

singhbhavya commented 5 months ago

Hi there, I updated the version to 2.17.4 and received the same error. Sending you an email! thank you so much!

singhbhavya commented 5 months ago

Hi there! I identified the problems and fixed them. In case anyone else goes through the same thing, here they are:

There were unexpected characters in the FASTA headers. I replaced those characters in the genome, and re-aligned the FASTQs to the genome.
Due to the characters, the genome wasn't being correctly loaded into IGV, and this fixed it as well.

I used a combination of these two scripts:

Python script to remove ">":

import re

def replace_gt_with_dash_except_first(filename):
    with open(filename, 'r') as file:
        lines = file.readlines()

    with open(filename, 'w') as file:
        for line in lines:
            if line.startswith('>'):
                # Replace '>' with '-' except the first instance
                parts = line.split('>')
                line = parts[0] + '>' + '-'.join(parts[1:])
            file.write(line)

filename = 'Genomic_sequences_fromFusion_batch1.fasta'
replace_gt_with_dash_except_first(filename)

Bash script to remove parentheses and dashes.:

#!/bin/bash

# Function to replace dashes and parentheses with colons in sequence names
replace_specific_chars() {
    input_file=$1
    output_file=$2

    sed -E '/^>/ s/[-()]/:/g' $input_file > $output_file
}

input_file="Genomic_sequences_fromFusion_batch1.fasta"
output_file="output.fasta"

replace_specific_chars $input_file $output_file

jrobinso commented 5 months ago

@singhbhavya Thanks for this, I'm sure it will be helpful. If you could post one of the offending fasta header lines here I will see if we can improve the parser to load it without modification. The main rule is the sequence name should be the string between the initial ">" and the first whitespace, we should be able to change the parser to ignore everything else.

igvteam / igv

Custom Genome and Visualizations #1519