pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

What next after seeing X characters in my all_prot.fasta #19

Closed eyeamnice closed 4 years ago

eyeamnice commented 4 years ago

Hi SIFT team,

I ran cat all_prot.fasta | grep -v ">" | grep X and I got some output containing XXX as shown below:

MGLLSFVFGGLGFILIGAHEALLHSSPSSQNKKTKTLFSISLVLFSSFFILNSTLSLFDAHSSNDAVGAALQLQVLSIAFVFLFYSLLPLLSLSFTLPSPLLNLVGAFAFAEEFLLFYLXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXGDDHTPGG
MVEDGDEVDSMSAETARAIVGHGGVRPLVALCQTGDXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXKEHAAECLQNLTASNENLRKSVISEGGVRSLLAYLDGPLPQESAVGALRNLVGLVPE

What should I do next in order to continue building the database. It seem @pauline-ng previously mentioned that cat cmd, should not print any output, in this case. Do I have to start from the beginning? If so will the files already created be overwritten?

I am asking because it took a long time to reach this level. If not, how do I restart from where I am currently. I am stuck at:

** Checking query data and substitutions files **
* processing queries: 100.00/100.00% *

** Searching database for candidate sequences **
* processing database part 200 (size ~0.25 GB): 0.00/100.00% * *

I also wonder how many parts there is with uniref90? I divided the uniref90.fasta by 0.25, I got 188 but my sift4g run, indicated above, was processing part 200. Whats the correlation there? I had to run this for days to get to this point.

pauline-ng commented 4 years ago

Once you have all_prot.fasta made, you don't need to rerun lines 50-106 in make-SIFT-db-all.pl Just delete lines 50-106, and then run it.

Generating the SIFT predictions from SIFT 4G should be relatively fast; it's the pre-processing that is computationally expensive.

If it still takes a long time (more than 3 days), please check there aren't duplicates in all_prot.fasta (grep ">" all_prot.fasta | sort | uniq -d ) This would happen if you reran make-SIFT-db-all.pl without deleting the contents of the working folder. This would make SIFT longer to run.

eyeamnice commented 4 years ago

Thank you Pauline. Should the compressed file for regions in the output directory contain SIFT scores? I looked at the results with headers:

#Position       Ref_allele      New_allele      Transcript_id   Gene_id Gene_name       Region  Ref_amino_acid  New_amino_acid  Position_of_amino_acid_substitution     SIFT_score      SIFT_median_sequence_info       Num_seqs_at_positiondbSNP_id

I see that columns Position_of_amino_acid_substitution,SIFT_score and SIFT_median_sequence_info all contains NA. Is that normal or is something wrong? I only have 1.gz and 1.regions at the moment

Also which contents do I need to delete? I followed the exact directory naming structure as given in SIFT example.

pauline-ng commented 4 years ago

Hi,

Go to https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB Scroll to the bottom where it says "Monitoring the Database Creation Process"

Follow the commands under the 2nd column "Check the following." If any folders are empty, that will help pinpoint the problem.

eyeamnice commented 4 years ago

Hi Pauline, I followed your directions and checked the database creation process. I think it was going good but it seem to have run out of space at this stage with error:

sort: write failed: /tmp/sort3Ipoki: No space left on device
Traceback (most recent call last):
  File "make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
  File "make_regions_file.py", line 8, in get_pos
    return int (fields[0])
ValueError: invalid literal for int() with base 10: ''
sort: write failed: /tmp/sortBthHCE: No space left on device
Traceback (most recent call last):
  File "make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
  File "make_regions_file.py", line 8, in get_pos
    return int (fields[0])
ValueError: invalid literal for int() with base 10: ''
sort: write failed: /tmp/sortQN1GPd: No space left on device
Traceback (most recent call last):
  File "make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
  File "make_regions_file.py", line 8, in get_pos
    return int (fields[0])
ValueError: invalid literal for int() with base 10: ''
sort: write failed: /tmp/sortnpz5qQ: No space left on device
Traceback (most recent call last):
  File "make_regions_file.py", line 68, in <module>
    get_regions (chrom_file, out_file)
  File "make_regions_file.py", line 31, in get_regions
    pos = get_pos (first_line)
  File "make_regions_file.py", line 8, in get_pos
    return int (fields[0])
ValueError: invalid literal for int() with base 10: ''
cat: /parent_dir/singleRecords//KZ847133.singleRecords: No such file or directory
can't open /parent_dir/singleRecords//KZ847133.singleRecords at map-scores-back-to-records.pl line 122.
Unable to read from /parent_dir/singleRecords_with_scores/KZ847133_scores.Srecords

What do you recommend I should do? Although it went through and zipped up the chr-src and printed Done at the end

pauline-ng commented 4 years ago

You have no space left on your device. Please free up some GB.