jl02142 / OrthoRefine

GNU General Public License v3.0
2 stars 0 forks source link

Segmentation fault #3

Open juansebe1 opened 3 weeks ago

juansebe1 commented 3 weeks ago

Hello! I'm trying to make a test of this tool with a small dataset, but I got a message of 'Segmentation fault' and I don't know what's going, could you help me to solve it?

./orthorefine.exe --input input.txt --OF_file N0.tsv --window_size 8 --synteny_ratio 0.5 Segmentation fault

input.txt

Head of N0.tsv: HOG OG Gene Tree Parent Clade F.hispaniensis F.opportunistica-142155 F.salina N0.HOG0000000 OG0000000 n0 lcl|NZ_CP018093.1_prot_FSC454_RS07665_1515, lcl|NZ_CP018093.1_prot_FSC454_RS03685_724 lc>N0.HOG0000001 OG0000001 n0 lcl|NZ_CP022375.1_prot_WP_198150407.1_912, lcl|NZ_CP022375.1_prot_WP_198150407.1_1283, lcl>N0.HOG0000002 OG0000002 n1 lcl|NZ_CP018093.1_prot_WP_197456248.1_1581, lcl|NZ_CP018093.1_prot_WP_197456248.1_582 lc>N0.HOG0000003 OG0000002 n3 lcl|NZ_CP018093.1_prot_WP_156470857.1_1013, lcl|NZ_CP018093.1_prot_WP_244148270.1_1829, lc>N0.HOG0000004 OG0000003 n0 lcl|NZ_CP018093.1_prot_FSC454_RS02585_506, lcl|NZ_CP018093.1_prot_FSC454_RS00585_115, lcl|>N0.HOG0000005 OG0000004 n1 lcl|NZ_CP018093.1_prot_WP_066046703.1_444 lcl|NZ_CP022375.1_prot_WP_071629540.1_1295>N0.HOG0000006 OG0000004 n4 lcl|NZ_CP018093.1_prot_WP_014715333.1_1297 lcl|NZ_CP022375.1_prot_WP_071629258.1_988 >N0.HOG0000007 OG0000004 n6 lcl|NZ_CP018093.1_prot_WP_003033846.1_1161 lcl|NZ_CP022375.1_prot_WP_071629064.1_771 >N0.HOG0000008 OG0000005 n0 lcl|NZ_CP018093.1_prot_FSC454_RS10000_1199, lcl|NZ_CP018093.1_prot_FSC454_RS06065_1201, lc>N0.HOG0000009 OG0000006 n0 lcl|NZ_CP018093.1_prot_WP_071794787.1_854, lcl|NZ_CP018093.1_prot_WP_071794788.1_855, lcl|>N0.HOG0000010 OG0000007 n1 lcl|NZ_CP018093.1_prot_WP_066046330.1_168 lcl|NZ_CP022375.1_prot_WP_071628521.1_157 >N0.HOG0000011 OG0000007 n3 lcl|NZ_CP018093.1_prot_WP_066045340.1_1632 lcl|NZ_CP022375.1_prot_WP_071629755.1_1531>N0.HOG0000012 OG0000008 n0 lcl|NZ_CP018093.1_prot_FSC454_RS01480_292, lcl|NZ_CP018093.1_prot_WP_231865178.1_1110, lcl>N0.HOG0000013 OG0000009 n1 lcl|NZ_CP018093.1_prot_FSC454_RS05915_1169 lcl|NZ_CP022375.1_prot_WP_071629056.1_763 >N0.HOG0000014 OG0000009 n3 lcl|NZ_CP018093.1_prot_WP_066045627.1_372 lcl|NZ_CP022375.1_prot_WP_071629606.1_1370>N0.HOG0000015 OG0000010 n0 lcl|NZ_CP018093.1_prot_WP_066046902.1_1055, lcl|NZ_CP018093.1_prot_WP_080555366.1_704 lc>N0.HOG0000016 OG0000011 n0 lcl|NZ_CP018093.1_prot_FSC454_RS03655_718 lcl|NZ_CP022375.1_prot_WP_071628872.1_552 >N0.HOG0000017 OG0000012 n0 lcl|NZ_CP018093.1_prot_WP_066044704.1_770 lcl|NZ_CP022375.1_prot_WP_071628925.1_609 >N0.HOG0000018 OG0000013 n1 lcl|NZ_CP018093.1_prot_WP_156860530.1_1116, lcl|NZ_CP018093.1_prot_WP_066045399.1_1117 lc>N0.HOG0000019 OG0000013 n3 lcl|NZ_CP018093.1_prot_WP_156860531.1_1542, lcl|NZ_CP018093.1_prot_WP_156860529.1_1115, lc>N0.HOG0000020 OG0000014 n1 lcl|NZ_CP018093.1_prot_WP_066045288.1_1653 lcl|NZ_CP022375.1_prot_WP_071629770.1_1546>N0.HOG0000021 OG0000014 n3 lcl|NZ_CP018093.1_prot_WP_244148253.1_1179 lcl|NZ_CP022375.1_prot_WP_071629238.1_963,>N0.HOG0000022 OG0000015 n1 lcl|NZ_CP018093.1_prot_WP_066046621.1_1788 lcl|NZ_CP022375.1_prot_WP_071629866.1_1651>N0.HOG0000023 OG0000015 n3 lcl|NZ_CP018093.1_prot_WP_066046618.1_1787 lcl|NZ_CP022375.1_prot_WP_071629865.1_1650>N0.HOG0000024 OG0000016 n1 lcl|NZ_CP018093.1_prot_WP_066046651.1_1806 lcl|NZ_CP022375.1_prot_WP_071629921.1_1711>N0.HOG0000025 OG0000016 n4 lcl|NZ_CP018093.1_prot_WP_014549092.1_1805 lcl|NZ_CP022375.1_prot_WP_071629920.1_1710>N0.HOG0000026 OG0000017 n1 lcl|NZ_CP018093.1_prot_WP_066045151.1_48 lcl|NZ_CP022375.1_prot_WP_071628416.1_40 >

jl02142 commented 3 weeks ago

The gene/protein identifiers ("lcl|NZ_CP018093.1_prot_FSC454_RS07665_1515") are unique to me - which database or program are these from? I've written a bash script you can copy and paste into a text file in the same directory you are attempting to run OrthoRefine from. Running it will generate a file called "OrthoRefine_debug.log.txt", which you can copy or post back. As a privacy warning, it will print all the file names in the current directory - which you may want to review before posting them.

#!/usr/bin/env bash

# print current files names only in dir
ls > OrthoRefine_debug.log.txt
echo >> OrthoRefine_debug.log.txt

#print contents of "input.txt" to OrthoRefine_debug.log.txt
cat input.txt >> OrthoRefine_debug.log.txt
echo >> OrthoRefine_debug.log.txt

# check if file "No.tsv" exist in current directory
if [ -f "N0.tsv" ]; then
    echo "N0.tsv exists" >> OrthoRefine_debug.log.txt
    # check if file "N0.tsv" is in dos or unix format 
    if file "N0.tsv" | grep -q "CRLF"; then
        echo "N0.tsv is in dos format" >> OrthoRefine_debug.log.txt
    else
        echo "N0.tsv is in unix format" >> OrthoRefine_debug.log.txt
    fi
else
    echo "N0.tsv does not exist" >> OrthoRefine_debug.log.txt
fi

echo >> OrthoRefine_debug.log.txt
# read the first column of each line from "input.txt" and write to array
IFS=$'\n' read -d '' -r -a lines < input.txt
# verify that each element of array exists as a feature table file. E.g. "GCF_000005845.2" as "GCF_000005845.2_ASM584v2_feature_table.txt"
for i in "${lines[@]}" ; do
    if [ -f "$i"*_feature_table.txt ]; then
        echo "$i"*_feature_table.txt exists >> OrthoRefine_debug.log.txt
    else
        echo "$i"_feature_table.txt does not exist >> OrthoRefine_debug.log.txt
    fi
done

echo >> OrthoRefine_debug.log.txt
# verify that each element of array exists as a fasta file. E.g. "GCF_000005845.2" as "GCF_000005845.2_ASM584v2_protein.faa"
for i in "${lines[@]}" ; do
    if [ -f "$i"*_protein.faa ]; then
        echo "$i"*_protein.faa exists >> OrthoRefine_debug.log.txt
    else
        echo "$i"_protein.faa does not exist >> OrthoRefine_debug.log.txt
    fi
done

echo >> OrthoRefine_debug.log.txt
# Check that both the feature table and fasta file exist for the same element of array lines
for i in "${!lines[@]}" ; do
    if [ -f "${lines[$i]}"*_feature_table.txt ] && [ -f "${lines[$i]}"*_protein.faa ]; then
        echo "Both feature table and fasta file exist for ${lines[$i]}" >> OrthoRefine_debug.log.txt
        # Store the index of the first occurrence
        first_occurrence=$i
        break
    fi
done
# Store the 11th column from the output above in a new array
IFS=$'\n' read -d '' -r -a new_array <<< "$(grep -m 10 "^CDS" "${lines[first_occurrence]}"*_feature_table.txt | awk '{print $11}')"
# Verify each element of new_array can be found in the asscoiated fasta file
for i in "${new_array[@]}" ; do
    if grep -q "$i" "${lines[first_occurrence]}"*_protein.faa; then
        echo "$i" exists in "${lines[first_occurrence]}"*_protein.faa >> OrthoRefine_debug.log.txt
    else
        echo "$i" does not exist in "${lines[first_occurrence]}"*_protein.faa >> OrthoRefine_debug.log.txt
    fi
done

echo >> OrthoRefine_debug.log.txt
# Verify that each element of new_array can be found in "N0.tsv"
for i in "${new_array[@]}" ; do
    if grep -q "$i" N0.tsv; then
        echo "$i" exists in N0.tsv >> OrthoRefine_debug.log.txt
        # store the line from N0.tsv where the element was found into another array without overwriting the previous element without splitting the line
        tmp=$(grep "$i" N0.tsv)
        N0_lines+="$tmp\n"
    else
        echo "$i" does not exist in N0.tsv >> OrthoRefine_debug.log.txt
    fi
done

echo >> OrthoRefine_debug.log.txt
# print array N0_lines
for i in "${N0_lines[@]}" ; do
    echo -e "$i" >> OrthoRefine_debug.log.txt # -e allows for newline char printing
done

# find the longest line by tabs and comma count in N0_lines
longest_line=$(echo -e "${N0_lines[@]}" | awk -F"\t|," '{print NF}' | sort -n | tail -1)

# for each element of N0_lines, extract the 4th column to end of line and store in another array
IFS=$'\n' read -d '' -r -a new_array2 <<< "$(echo -e "${N0_lines[@]}" | awk -F"\t|," -v e=$longest_line '{ for (i=4;i<e;++i) print $i }' | tr -d ' ')"

# for each element of new_array2, continue to search the feature table files 11th column until the element is found
for i in "${new_array2[@]}" ; do
    found_flag=0
    for j in "${lines[@]}" ; do
        if grep -q "$i" "$j"*_feature_table.txt; then
            echo "$i" exists in "$j"*_feature_table.txt >> OrthoRefine_debug.log.txt
            found_flag=1
            break
        fi
    done
    if [ $found_flag -eq 0 ]; then
        echo "$i" does not exist in any feature_table.txt >> OrthoRefine_debug.log.txt
    fi
done
juansebe1 commented 3 weeks ago

Hi! Those are extracted from GenBank and then I runned OrthoFinder to obtain the 'N0.tsv' file.

Here is the result from the bash script : OrthoRefine_debug.log.txt

jl02142 commented 3 weeks ago

Are the feature table files (or genome annotation files) located in a different directory? I've attached my debug output from my test E. coli run so you can see what it should kinda look like.

GCF_000005845.2_ASM584v2_feature_table.txt
GCF_000005845.2_ASM584v2_protein.faa
GCF_013892435.1_ASM1389243v1_feature_table.txt
GCF_013892435.1_ASM1389243v1_protein.faa
GCF_016904755.1_ASM1690475v2_feature_table.txt
GCF_016904755.1_ASM1690475v2_protein.faa
GCF_902709585.1_H1-003-0086-C-F.v2_feature_table.txt
GCF_902709585.1_H1-003-0086-C-F.v2_protein.faa
N0.tsv
OrthoFinder
OrthoRefine_debug.log.txt
OrthoRefine_debug.sh
download_ft_fa.txt
input.txt

GCF_000005845.2
GCF_013892435.1
GCF_016904755.1
GCF_902709585.1

N0.tsv exists
N0.tsv is in unix format

GCF_000005845.2_ASM584v2_feature_table.txt exists
GCF_013892435.1_ASM1389243v1_feature_table.txt exists
GCF_016904755.1_ASM1690475v2_feature_table.txt exists
GCF_902709585.1_H1-003-0086-C-F.v2_feature_table.txt exists

GCF_000005845.2_ASM584v2_protein.faa exists
GCF_013892435.1_ASM1389243v1_protein.faa exists
GCF_016904755.1_ASM1690475v2_protein.faa exists
GCF_902709585.1_H1-003-0086-C-F.v2_protein.faa exists

Both feature table and fasta file exist for GCF_000005845.2
NP_414542.1 exists in GCF_000005845.2_ASM584v2_protein.faa
NP_414543.1 exists in GCF_000005845.2_ASM584v2_protein.faa
NP_414544.1 exists in GCF_000005845.2_ASM584v2_protein.faa
NP_414545.1 exists in GCF_000005845.2_ASM584v2_protein.faa
NP_414546.1 exists in GCF_000005845.2_ASM584v2_protein.faa
NP_414547.1 exists in GCF_000005845.2_ASM584v2_protein.faa
NP_414548.1 exists in GCF_000005845.2_ASM584v2_protein.faa
NP_414549.1 exists in GCF_000005845.2_ASM584v2_protein.faa
NP_414550.1 exists in GCF_000005845.2_ASM584v2_protein.faa
NP_414551.1 exists in GCF_000005845.2_ASM584v2_protein.faa

NP_414542.1 exists in N0.tsv
NP_414543.1 exists in N0.tsv
NP_414544.1 exists in N0.tsv
NP_414545.1 exists in N0.tsv
NP_414546.1 exists in N0.tsv
NP_414547.1 exists in N0.tsv
NP_414548.1 exists in N0.tsv
NP_414549.1 exists in N0.tsv
NP_414550.1 exists in N0.tsv
NP_414551.1 exists in N0.tsv

N0.HOG0000583   OG0000421   n0  NP_414542.1 WP_001386572.1  WP_001386572.1  WP_001386572.1
N0.HOG0000584   OG0000422   n0  NP_414543.1 WP_001264663.1  WP_059235060.1  WP_010378218.1
N0.HOG0000585   OG0000423   n0  NP_414544.1 WP_000252740.1  WP_000241676.1  WP_001517712.1
N0.HOG0000586   OG0000424   n0  NP_414545.1 WP_000781090.1  WP_208631050.1  WP_000781035.1
N0.HOG0000587   OG0000425   n0  NP_414546.1 WP_000771325.1  WP_000738743.1  WP_105224911.1
N0.HOG0000588   OG0000426   n0  NP_414547.1 WP_000906158.1  WP_000906164.1  WP_000906159.1
N0.HOG0000589   OG0000427   n0  NP_414548.1 WP_001112548.1  WP_059235058.1  WP_016249591.1
N0.HOG0000064   OG0000022   n4  NP_414549.1 WP_046076191.1, WP_000130195.1  WP_000130184.1  WP_000130186.1
N0.HOG0000590   OG0000428   n0  NP_414550.1 WP_046083027.1  WP_001094685.1  WP_001517716.1
N0.HOG0000591   OG0000429   n0  NP_414551.1 WP_000528529.1  WP_000528545.1  WP_000528538.1

NP_414542.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_001386572.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_001386572.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_001386572.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
NP_414543.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_001264663.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_059235060.1 exists in GCF_016904755.1_ASM1690475v2_feature_table.txt
WP_010378218.1 exists in GCF_902709585.1_H1-003-0086-C-F.v2_feature_table.txt
NP_414544.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_000252740.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_000241676.1 exists in GCF_016904755.1_ASM1690475v2_feature_table.txt
WP_001517712.1 exists in GCF_902709585.1_H1-003-0086-C-F.v2_feature_table.txt
NP_414545.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_000781090.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_208631050.1 exists in GCF_016904755.1_ASM1690475v2_feature_table.txt
WP_000781035.1 exists in GCF_902709585.1_H1-003-0086-C-F.v2_feature_table.txt
NP_414546.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_000771325.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_000738743.1 exists in GCF_016904755.1_ASM1690475v2_feature_table.txt
WP_105224911.1 exists in GCF_902709585.1_H1-003-0086-C-F.v2_feature_table.txt
NP_414547.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_000906158.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_000906164.1 exists in GCF_016904755.1_ASM1690475v2_feature_table.txt
WP_000906159.1 exists in GCF_902709585.1_H1-003-0086-C-F.v2_feature_table.txt
NP_414548.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_001112548.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_059235058.1 exists in GCF_016904755.1_ASM1690475v2_feature_table.txt
WP_016249591.1 exists in GCF_902709585.1_H1-003-0086-C-F.v2_feature_table.txt
NP_414549.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_046076191.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_000130195.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_000130184.1 exists in GCF_016904755.1_ASM1690475v2_feature_table.txt
NP_414550.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_046083027.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_001094685.1 exists in GCF_016904755.1_ASM1690475v2_feature_table.txt
WP_001517716.1 exists in GCF_902709585.1_H1-003-0086-C-F.v2_feature_table.txt
NP_414551.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
WP_000528529.1 exists in GCF_013892435.1_ASM1389243v1_feature_table.txt
WP_000528545.1 exists in GCF_016904755.1_ASM1690475v2_feature_table.txt
WP_000528538.1 exists in GCF_000005845.2_ASM584v2_feature_table.txt
juansebe1 commented 3 weeks ago

Yes, I had the feature table and protein.faa in other directory. Now I have these in the same directory but thedebug.log.txt script result says *_feature_table.txt does not exist

OrthoRefine_debug.log.txt

The ./orthorefine.exe doesn't run yet. What else is missing or doing wrong?

Thanks

jl02142 commented 3 weeks ago

I made a mistake in the debug script as you are using the second and third column of the input file. I also made a change to print the first line of the N0.tsv file for me to see. Can you recopy the script below and run again? If we keep encountering errors, we can drop the second and third column from the input file to see if that is problem.

#!/usr/bin/env bash

# print current files names only in dir
ls > OrthoRefine_debug.log.txt
echo >> OrthoRefine_debug.log.txt

#print contents of "input.txt" to OrthoRefine_debug.log.txt
cat input.txt >> OrthoRefine_debug.log.txt
echo >> OrthoRefine_debug.log.txt

# check if file "No.tsv" exist in current directory
if [ -f "N0.tsv" ]; then
    echo "N0.tsv exists" >> OrthoRefine_debug.log.txt
    head -1 N0.tsv >> OrthoRefine_debug.log.txt
    # check if file "N0.tsv" is in dos or unix format 
    if file "N0.tsv" | grep -q "CRLF"; then
        echo "N0.tsv is in dos format" >> OrthoRefine_debug.log.txt
    else
        echo "N0.tsv is in unix format" >> OrthoRefine_debug.log.txt
    fi
else
    echo "N0.tsv does not exist" >> OrthoRefine_debug.log.txt
fi

echo >> OrthoRefine_debug.log.txt
# read the first column of each line from "input.txt" and write to array
IFS=$'\n' read -d '' -r -a lines <<< $(cut -f1 input.txt)

# verify that each element of array exists as a feature table file. E.g. "GCF_000005845.2" as "GCF_000005845.2_ASM584v2_feature_table.txt"
for i in "${lines[@]}" ; do
    if [ -f "$i"*_feature_table.txt ]; then
        echo "$i"*_feature_table.txt exists >> OrthoRefine_debug.log.txt
    else
        echo "$i"_feature_table.txt does not exist >> OrthoRefine_debug.log.txt
    fi
done

echo >> OrthoRefine_debug.log.txt
# verify that each element of array exists as a fasta file. E.g. "GCF_000005845.2" as "GCF_000005845.2_ASM584v2_protein.faa"
for i in "${lines[@]}" ; do
    if [ -f "$i"*_protein.faa ]; then
        echo "$i"*_protein.faa exists >> OrthoRefine_debug.log.txt
    else
        echo "$i"_protein.faa does not exist >> OrthoRefine_debug.log.txt
    fi
done

echo >> OrthoRefine_debug.log.txt
# Check that both the feature table and fasta file exist for the same element of array lines
for i in "${!lines[@]}" ; do
    if [ -f "${lines[$i]}"*_feature_table.txt ] && [ -f "${lines[$i]}"*_protein.faa ]; then
        echo "Both feature table and fasta file exist for ${lines[$i]}" >> OrthoRefine_debug.log.txt
        # Store the index of the first occurrence
        first_occurrence=$i
        break
    fi
done
# Store the 11th column from the output above in a new array
IFS=$'\n' read -d '' -r -a new_array <<< "$(grep -m 10 "^CDS" "${lines[first_occurrence]}"*_feature_table.txt | awk '{print $11}')"
# Verify each element of new_array can be found in the asscoiated fasta file
for i in "${new_array[@]}" ; do
    if grep -q "$i" "${lines[first_occurrence]}"*_protein.faa; then
        echo "$i" exists in "${lines[first_occurrence]}"*_protein.faa >> OrthoRefine_debug.log.txt
    else
        echo "$i" does not exist in "${lines[first_occurrence]}"*_protein.faa >> OrthoRefine_debug.log.txt
    fi
done

echo >> OrthoRefine_debug.log.txt
# Verify that each element of new_array can be found in "N0.tsv"
for i in "${new_array[@]}" ; do
    if grep -q "$i" N0.tsv; then
        echo "$i" exists in N0.tsv >> OrthoRefine_debug.log.txt
        # store the line from N0.tsv where the element was found into another array without overwriting the previous element without splitting the line
        tmp=$(grep "$i" N0.tsv)
        N0_lines+="$tmp\n"
    else
        echo "$i" does not exist in N0.tsv >> OrthoRefine_debug.log.txt
    fi
done

echo >> OrthoRefine_debug.log.txt
# print array N0_lines
for i in "${N0_lines[@]}" ; do
    echo -e "$i" >> OrthoRefine_debug.log.txt # -e allows for newline char printing
done

# find the longest line by tabs and comma count in N0_lines
longest_line=$(echo -e "${N0_lines[@]}" | awk -F"\t|," '{print NF}' | sort -n | tail -1)

# for each element of N0_lines, extract the 4th column to end of line and store in another array
IFS=$'\n' read -d '' -r -a new_array2 <<< "$(echo -e "${N0_lines[@]}" | awk -F"\t|," -v e=$longest_line '{ for (i=4;i<e;++i) print $i }' | tr -d ' ')"

# for each element of new_array2, continue to search the feature table files 11th column until the element is found
for i in "${new_array2[@]}" ; do
    found_flag=0
    for j in "${lines[@]}" ; do
        if grep -q "$i" "$j"*_feature_table.txt; then
            echo "$i" exists in "$j"*_feature_table.txt >> OrthoRefine_debug.log.txt
            found_flag=1
            break
        fi
    done
    if [ $found_flag -eq 0 ]; then
        echo "$i" does not exist in any feature_table.txt >> OrthoRefine_debug.log.txt
    fi
done
juansebe1 commented 3 weeks ago

I was checking the feature tables and protein.faa that come in the 'pub_data' folder and they don't look like mine, so I extracted directly from the Genome assembly index, and now they look similar I runned the OrthoFinder again, added the new 'N0.tsv' and tried the ./orthorefine.exe script but the 'Segmetation fault' persists

And, if I dropped the 2nd and 3rd column from 'input.txt' show me a 'Error feature table file missing' message

This is the new debug.log OrthoRefine_debug.log.txt in this file you can see in line 51 and 56, that a pair of protein sequences 'do not exist in N0.tsv'

jl02142 commented 3 weeks ago

The proteins missing in N0.tsv isn't causing the crashing issue - it's to warn me that they were present in the feature table but not grouped by OrthoFinder into a HOG so I shouldn't expect to see them.

I downloaded the data files today and was able to run OrthoRefine (using the cpp file from the main and from the GFF branch) with both input files:

GCF_000219045.1
GCF_001885235.1
GCF_003347095.1

Note on this second input file, the columns need to be separated by tabs and not spaces

GCF_000219045.1 c       b
GCF_001885235.1 c       b
GCF_003347095.1 c       b

With the command:

./orthorefine.exe --input input.txt --OF_file N0.tsv --window_size 8 --synteny_ratio 0.5

By genome assembly index, do you mean https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/? If not, OrthoRefine includes a script to download the feature table files and protein fasta from NCBI.

./download_ft_fafiles.sh input.txt

I would recommend checking the input.txt file and the command used to call OrthoRefine. If that doesn't work and you obtained the data from somewhere else besides the ftp website I linked above, make a new directory and use the download_ft_fafiles script to obtain the feature table and fasta files and rerun OrthoFinder with OrthoRefine. If you continue to experience a crash issue, I'll have to write debugger instructions for GDB so I can know which line of code is causing it. I've attached OrthoRefine's output file for these three test data below.

HOG SOG Gene_name   GCF_000219045.1_ASM21904v1_feature_table.txt    GCF_001885235.1_ASM188523v1_feature_table.txt   GCF_003347095.1_ASM334709v1_feature_table.txt
N0.HOG0000000   0.0 glycosyltransferase F7308_RS04340       CGC43_RS01820   
N0.HOG0000001   1.0 glycosyltransferase family 2 protein        FSC454_RS03920  CGC43_RS03135   
N0.HOG0000004   4.0 Bcr/CflA family efflux MFS transporter  F7308_RS09935       CGC43_RS08655   
N0.HOG0000014   14.0    hypothetical protein    F7308_RS10380   FSC454_RS10085      
N0.HOG0000014   14.1    hypothetical protein    F7308_RS10385   FSC454_RS10090  CGC43_RS09225   
N0.HOG0000016   16.0    lipid A export permease/ATP-binding protein MsbA    F7308_RS03050   FSC454_RS08395  CGC43_RS07930   
N0.HOG0000018   18.0    lysophospholipid acyltransferase family protein F7308_RS09625   FSC454_RS09200  CGC43_RS08785   
N0.HOG0000019   19.0    oligopeptide:H+ symporter   F7308_RS05250   FSC454_RS03585  CGC43_RS02785   
N0.HOG0000019   19.1    oligopeptide:H+ symporter   F7308_RS06680   FSC454_RS05345      
N0.HOG0000021   21.0    DUF3568 family protein  F7308_RS09380   FSC454_RS09105  CGC43_RS08485   
N0.HOG0000034   34.0    adenosylmethionine--8-amino-7-oxononanoate transaminase F7308_RS06600   FSC454_RS05265  CGC43_RS04285   
N0.HOG0000036   36.0    hypothetical protein        FSC454_RS01460  CGC43_RS07380   
N0.HOG0000041   41.0    site-specific integrase F7308_RS02895   FSC454_RS07130      
N0.HOG0000042   42.0    beta-ketoacyl-ACP synthase II   F7308_RS04900   FSC454_RS03230  CGC43_RS02285   
N0.HOG0000047   47.0    linear amide C-N hydrolase  F7308_RS05335   FSC454_RS03700  CGC43_RS02870   
N0.HOG0000047   47.1    linear amide C-N hydrolase  F7308_RS07330   FSC454_RS04815      
N0.HOG0000052   52.0    pyridoxal phosphate-dependent aminotransferase  F7308_RS06130   FSC454_RS04485  CGC43_RS05500   
N0.HOG0000052   52.1    pyridoxal phosphate-dependent aminotransferase  F7308_RS09335   FSC454_RS09060      
N0.HOG0000063   63.0    LysR family transcriptional regulator   F7308_RS05745   FSC454_RS04215  CGC43_RS03300   
N0.HOG0000064   64.0    DUF3573 domain-containing protein       FSC454_RS02090, FSC454_RS02095  CGC43_RS06810   
N0.HOG0000065   65.0    IS5 family transposase      FSC454_RS09255  CGC43_RS08730   
N0.HOG0000067   67.0    glycosyl hydrolase family 18 protein    F7308_RS02810   FSC454_RS08580  CGC43_RS08115   
N0.HOG0000068   68.0    DUF3568 family protein  F7308_RS00145   FSC454_RS00125      
N0.HOG0000068   68.1    DUF3568 family protein  F7308_RS04665   FSC454_RS07225  CGC43_RS02075   
N0.HOG0000070   70.0    phosphatase PAP2 family protein     FSC454_RS06215  CGC43_RS03495   
N0.HOG0000073   73.0    restriction endonuclease subunit S  F7308_RS01485       CGC43_RS07515   
N0.HOG0000074   74.0    pilin   F7308_RS02110   FSC454_RS01950  CGC43_RS06950   
N0.HOG0000075   75.0    LysR substrate-binding domain-containing protein    F7308_RS02005   FSC454_RS01835  CGC43_RS07090   
N0.HOG0000076   76.0    DNA primase phage associated    F7308_RS02915   FSC454_RS09945, FSC454_RS09950      
N0.HOG0000076   76.1    toprim domain-containing protein    F7308_RS04495   FSC454_RS07110      
N0.HOG0000077   77.0    LysR family transcriptional regulator   F7308_RS02945   FSC454_RS08495  CGC43_RS08030   
N0.HOG0000078   78.0    efflux RND transporter periplasmic adaptor subunit  F7308_RS03035   FSC454_RS08410  CGC43_RS07945   
N0.HOG0000082   82.0    SDR family oxidoreductase   F7308_RS05750   FSC454_RS04220  CGC43_RS03305   
N0.HOG0000083   83.0    3-oxoacyl-ACP reductase FabG    F7308_RS04910   FSC454_RS03240  CGC43_RS02295   
N0.HOG0000084   84.0    diaminopimelate decarboxylase   F7308_RS03550   FSC454_RS07985  CGC43_RS07580   
N0.HOG0000085   85.0    glycosyltransferase family 2 protein    F7308_RS04565   FSC454_RS07330  CGC43_RS02000   
N0.HOG0000086   86.0    glycosyltransferase family 2 protein        FSC454_RS03915  CGC43_RS03130   
N0.HOG0000095   95.0    outer membrane beta-barrel protein  F7308_RS10125   FSC454_RS09920  CGC43_RS09175   
N0.HOG0000096   96.0    class A beta-lactamase  F7308_RS06050   FSC454_RS06185      
N0.HOG0000096   96.1    class A beta-lactamase  F7308_RS07675   FSC454_RS06710  CGC43_RS05145   
N0.HOG0000097   97.0    ATP-binding cassette domain-containing protein  F7308_RS08020   FSC454_RS06375  CGC43_RS05645   
N0.HOG0000103   103.0   acyl carrier protein    F7308_RS04905   FSC454_RS03235  CGC43_RS02290   
N0.HOG0000105   105.0   aspartate carbamoyltransferase  F7308_RS00095   FSC454_RS00070  CGC43_RS00050   
N0.HOG0000106   106.0   hypothetical protein    F7308_RS00480   FSC454_RS00460      
N0.HOG0000107   107.0   YoaK family protein F7308_RS00410   FSC454_RS00370  CGC43_RS00325   
N0.HOG0000108   108.0   aromatic amino acid transport family protein    F7308_RS00415   FSC454_RS00375  CGC43_RS00330   
N0.HOG0000109   109.0   NAD-dependent succinate-semialdehyde dehydrogenase  F7308_RS00570   FSC454_RS00555  CGC43_RS00465   
N0.HOG0000110   110.0   ABC transporter permease subunit    F7308_RS00645   FSC454_RS00630  CGC43_RS00520   
N0.HOG0000111   111.0   amidohydrolase family protein       FSC454_RS06805  CGC43_RS05235   
N0.HOG0000112   112.0   NAD(P)H:quinone oxidoreductase  F7308_RS00875   FSC454_RS00865  CGC43_RS00795   
N0.HOG0000113   113.0   alpha-hydroxy acid oxidase  F7308_RS01025   FSC454_RS01035      
N0.HOG0000114   114.0   cysteine synthase family protein    F7308_RS05090   FSC454_RS03445  CGC43_RS02475   
N0.HOG0000115   115.0   class II fumarate hydratase F7308_RS01040   FSC454_RS01050  CGC43_RS01015   
N0.HOG0000118   118.0   YciI family protein F7308_RS04895   FSC454_RS03225  CGC43_RS02280   
N0.HOG0000119   119.0   site-specific tyrosine recombinase XerD F7308_RS03265   FSC454_RS08140  CGC43_RS07715   
N0.HOG0000121   121.0   SulP family inorganic anion transporter F7308_RS08435   FSC454_RS02915      
N0.HOG0000122   122.0   hypothetical protein    F7308_RS01640   FSC454_RS01455  CGC43_RS07385   
N0.HOG0000123   123.0   alanine racemase    F7308_RS08130   FSC454_RS04715  CGC43_RS05750   
N0.HOG0000124   124.0   APC family permease F7308_RS06770   FSC454_RS05440  CGC43_RS04415   
N0.HOG0000125   125.0   ATP-binding cassette domain-containing protein  F7308_RS01740   FSC454_RS01545  CGC43_RS07305   
N0.HOG0000126   126.0   sulfite exporter TauE/SafE family protein   F7308_RS01940   FSC454_RS01765  CGC43_RS07145   
N0.HOG0000128   128.0   S-(hydroxymethyl)glutathione dehydrogenase/class III alcohol dehydrogenase  F7308_RS02085   FSC454_RS01920  CGC43_RS06975   
N0.HOG0000129   129.0   PepSY domain-containing protein F7308_RS02950   FSC454_RS08490  CGC43_RS08025   
N0.HOG0000130   130.0   helix-turn-helix transcriptional regulator  F7308_RS05240   FSC454_RS03575  CGC43_RS02775   
N0.HOG0000131   131.0   MFS transporter F7308_RS08460   FSC454_RS02890  CGC43_RS06030   
N0.HOG0000132   132.0   amidophosphoribosyltransferase  F7308_RS02510   FSC454_RS08925  CGC43_RS08360   
N0.HOG0000133   133.0   hypothetical protein    F7308_RS02535   FSC454_RS08895      
N0.HOG0000134   134.0   MFS transporter F7308_RS02830   FSC454_RS08560  CGC43_RS08095   
N0.HOG0000135   135.0   hypothetical protein    F7308_RS02910   FSC454_RS10095, FSC454_RS03065      
N0.HOG0000137   137.0   efflux RND transporter permease subunit F7308_RS03030   FSC454_RS08415  CGC43_RS07950   
N0.HOG0000140   140.0   glycine C-acetyltransferase F7308_RS08550   FSC454_RS02835  CGC43_RS02740   
N0.HOG0000143   143.0   DegT/DnrJ/EryC1/StrS family aminotransferase    F7308_RS04285       CGC43_RS01845   
N0.HOG0000144   144.0   glucosyltransferase domain-containing protein       FSC454_RS08120  CGC43_RS07705   
N0.HOG0000145   145.0   bifunctional UDP-N-acetylglucosamine diphosphorylase/glucosamine-1-phosphate N-acetyltransferase GlmU   F7308_RS09270   FSC454_RS02260  CGC43_RS06610   
N0.HOG0000146   146.0   ATP-binding cassette domain-containing protein  F7308_RS08445   FSC454_RS02905  CGC43_RS06015   
N0.HOG0000147   147.0   glycosyltransferase family 1 protein    F7308_RS05575   FSC454_RS03925  CGC43_RS03140   
N0.HOG0000148   148.0   MFS transporter F7308_RS03025       CGC43_RS07955   
N0.HOG0000150   150.0   DotU family type IV/VI secretion system protein F7308_RS05015   FSC454_RS03355  CGC43_RS02400   
N0.HOG0000151   151.0   type VI secretion system baseplate subunit TssF/IglH    F7308_RS05020   FSC454_RS03360  CGC43_RS02405   
N0.HOG0000152   152.0   type VI secretion system lipoprotein IglE   F7308_RS05040   FSC454_RS03380  CGC43_RS02425   
N0.HOG0000153   153.0   DEAD/DEAH box helicase  F7308_RS06675   FSC454_RS05340  CGC43_RS04355   
N0.HOG0000154   154.0   KpsF/GutQ family sugar-phosphate isomerase  F7308_RS05520   FSC454_RS03845  CGC43_RS03025   
N0.HOG0000155   155.0   MFS transporter F7308_RS05630   FSC454_RS04100  CGC43_RS03190   
N0.HOG0000156   156.0   nuclease-related domain-containing protein  F7308_RS05770   FSC454_RS04235  CGC43_RS03330   
N0.HOG0000158   158.0   hypothetical protein    F7308_RS05820, F7308_RS05825    FSC454_RS04280      
N0.HOG0000159   159.0   sugar porter family MFS transporter F7308_RS07425   FSC454_RS06865      
N0.HOG0000160   160.0   LysR substrate-binding domain-containing protein    F7308_RS06420   FSC454_RS05040  CGC43_RS03825   
N0.HOG0000161   161.0   FUSC family protein F7308_RS00430   FSC454_RS00390  CGC43_RS00345   
N0.HOG0000164   164.0   alpha/beta hydrolase fold domain-containing protein F7308_RS07040   FSC454_RS05795  CGC43_RS04020   
N0.HOG0000165   165.0   bifunctional methionine sulfoxide reductase B/A protein F7308_RS07290   FSC454_RS04835  CGC43_RS04635   
N0.HOG0000166   166.0   cation:proton antiporter    F7308_RS07730   FSC454_RS06765  CGC43_RS05200   
N0.HOG0000167   167.0   OmpA family protein F7308_RS08075   FSC454_RS04765  CGC43_RS05700   
N0.HOG0000168   168.0   SprT family zinc-dependent metalloprotease  F7308_RS08110   FSC454_RS04730  CGC43_RS05735   
N0.HOG0000169   169.0   APC family permease F7308_RS08125   FSC454_RS04720  CGC43_RS05745   
N0.HOG0000171   171.0   preprotein translocase subunit SecA F7308_RS08215   FSC454_RS06955  CGC43_RS05830   
N0.HOG0000172   172.0   prepilin-type N-terminal cleavage/methylation domain-containing protein F7308_RS08250   FSC454_RS06995  CGC43_RS05870   
N0.HOG0000173   173.0   L-threonine 3-dehydrogenase F7308_RS08555   FSC454_RS02830  CGC43_RS02735   
N0.HOG0000174   174.0   helix-turn-helix domain-containing protein  F7308_RS08840   FSC454_RS02545  CGC43_RS06195   
N0.HOG0000175   175.0   Na+/H+ antiporter NhaA  F7308_RS09635   FSC454_RS09210  CGC43_RS08800   
N0.HOG0000176   176.0   ion channel F7308_RS07655   FSC454_RS06690  CGC43_RS05125   
N0.HOG0000177   177.0   prepilin-type N-terminal cleavage/methylation domain-containing protein F7308_RS01620   FSC454_RS01430  CGC43_RS07410   
N0.HOG0000178   178.0   hypothetical protein    F7308_RS01625   FSC454_RS01435  CGC43_RS07405   
N0.HOG0000179   179.0   ATP-binding protein     FSC454_RS10030  CGC43_RS01425   
N0.HOG0000180   180.0   hypothetical protein        FSC454_RS09025, FSC454_RS09030  CGC43_RS08450   
N0.HOG0000181   181.0   ATP-grasp domain-containing protein     FSC454_RS07885  CGC43_RS01370   
N0.HOG0000184   184.0   isochorismatase family protein      FSC454_RS06785  CGC43_RS05215   
N0.HOG0000226   226.0   transglutaminase family protein F7308_RS00165   FSC454_RS00145      
N0.HOG0000337   337.0   putative basic amino acid antiporter YfcC   F7308_RS01330   FSC454_RS01285      
N0.HOG0000371   371.0   lytic polysaccharide monooxygenase  F7308_RS05680   FSC454_RS04150      
N0.HOG0000383   383.0   type I restriction endonuclease subunit R   F7308_RS01500       CGC43_RS07490   
N0.HOG0000499   499.0   lysine-sensitive aspartokinase 3    F7308_RS09360   FSC454_RS09085      
N0.HOG0000544   544.0   FAD-dependent oxidoreductase    F7308_RS02980   FSC454_RS08460      
N0.HOG0000860   860.0   hypothetical protein    F7308_RS10205   FSC454_RS04995      
N0.HOG0000920   920.0   GTP-binding protein F7308_RS06890   FSC454_RS05590      
N0.HOG0000923   923.0   TIM-barrel domain-containing protein    F7308_RS06940   FSC454_RS05680      
N0.HOG0000966   966.0   multidrug effflux MFS transporter   F7308_RS07285   FSC454_RS04840      
N0.HOG0001174   1174.0  aromatic amino acid transport family protein    F7308_RS08030   FSC454_RS06385      
N0.HOG0001211   1211.0  LysR family transcriptional regulator   F7308_RS00150   FSC454_RS00130      
N0.HOG0001262   1262.0  peptide MFS transporter F7308_RS08740   FSC454_RS02655      
N0.HOG0001318   1318.0  MFS transporter F7308_RS06935   FSC454_RS05675      
N0.HOG0001326   1326.0  N-acetyltransferase     FSC454_RS00705  CGC43_RS00575   
N0.HOG0001329   1329.0  arginine deiminase-related protein      FSC454_RS07550  CGC43_RS01685   
N0.HOG0001331   1331.0  MFS transporter     FSC454_RS06980  CGC43_RS05855   
N0.HOG0001332   1332.0  restriction endonuclease subunit S      FSC454_RS04440  CGC43_RS05540   
Number of HOGs refined: 117 for a total refinement of   124