lovemun / SEXCMD

5 stars 5 forks source link

marker file creation issue #2

Open npsonis opened 5 years ago

npsonis commented 5 years ago

I am trying to make my own marker file for humans (since the software does not provide the one used in the paper).

In step 1 I got the following error

Traceback (most recent call last): File "../../../aDNA_programs/SEXCMD/1.extract_chr_from_REF.py", line 28, in <m outfile.write('>' + CHR + '\n') I am using python 2.7 I tried different reference genomes (from NCBI, 1000K, UCSN) and builds 37/38 but did not worked. Is there a solution for this?

In order to move forward I downloaded the chromosomes X and Y from NCBI, and changed the names of the fasta to CHRX and CHRY (because I think the script in step 1 was also doing this).

Then, in step 2 I met an allocation limit problem with lastz:

FAILURE: in add_segment() table size (4,869,542,152 for 101,448,794 segments) exceeds allocation limit of 4,294,967,279; consider raising scoring threshold (--hspthresh or --exact) or breaking your target sequence into smaller pieces

At first I tried the lastz_32 (more RAM memory) instead of lastz. Gave me the same error and also it was too slow.

Then I added in the 2nd command the argument --hspthresh=4000 (3000 is the default) It return the same error I gradually (per 1000) increased the value and at 10000 it did not gave me an error but it stack for two days. It did the same with 20000.

May I have some help with these issues?

Thank you in advance!

lovemun commented 5 years ago

In step1, the input file of SEXCMD is required a fasta file format which starts '>chr1'. I think the header of your fasta file is included more information, such as AC, gi, LN, rl, M5, etc.

We test SEXCMD to the fasta file from UCSC genome browser. Please download the fasta file from UCSC genome browser and re-try SEXCMD to the fasta file. If you have the same problem, please leave the message in this issues tab.

npsonis commented 5 years ago

I downloaded the hg38.fa.gz from UCSC and gunzipped it. (should I have indexed the genome, e.g. with BWA?)

Then I typed: python2 mypath/SEXCMD/1.extract_chr_from_REF.py mypath/UCSC/hg38/hg38.fa Y

It returned again the same issue

Traceback (most recent call last): File "mypath/SEXCMD/1.extract_chr_from_REF.py", line 28, in outfile.write('>' + CHR + '\n') NameError: name 'outfile' is not defined

npsonis commented 5 years ago

Hi lovemun,

May I have a feedback please.

Best,

Nikos

npsonis commented 5 years ago

Hi!

I managed to properly create the marker file and perform the sex determination by using the following modifications:

In step 1 I made the following modifications in 1.extract_chr_from_REF.py changed third "if" statement to "elif" changed "args[1][(len(args[1])-3)" to "args[1][(len(args[1])-2)" (fa == fa and not fa. == fa)

In step 2 in the lastz command I used as input the output of the first commpand which is hg38_chrX.fasta and not hg38.chrX.fasta.

In step 3 Since the are is no instruction on how to generated the gtf file: I went to UCSC Table Browser, http://genome.ucsc.edu/cgi-bin/hgTables and selected Clade : Mammal Genome : Human Assembly : latest one (Currently GRCh38/hg38) Group : Genes and Gene Predictions Track : RefSeq Gene (I used NCBI RefSeq) Table : refGene (I used UCSC ReSeq (refGene)) region : Check genome Output fromat : GTF - gene transfer format (limited) Output file : hg38.ucsc.gtf Pressed Get output

Finally I modified the SEXCMD.R script as follows: changed "if (sextype = "XY")" to "if (sextype == "XY")" (appears in two different lines) changed "if (sextype = "ZW")" to "if (sextype == "ZW")" (appears in two different lines)

Please let me know that the modifications were properly made.

Best,

Nikos

nbbarrientos commented 4 years ago

Hi!

I managed to properly create the marker file and perform the sex determination by using the following modifications:

In step 1 I made the following modifications in 1.extract_chr_from_REF.py changed third "if" statement to "elif" changed "args[1][(len(args[1])-3)" to "args[1][(len(args[1])-2)" (fa == fa and not fa. == fa)

In step 2 in the lastz command I used as input the output of the first commpand which is hg38_chrX.fasta and not hg38.chrX.fasta.

In step 3 Since the are is no instruction on how to generated the gtf file: I went to UCSC Table Browser, http://genome.ucsc.edu/cgi-bin/hgTables and selected Clade : Mammal Genome : Human Assembly : latest one (Currently GRCh38/hg38) Group : Genes and Gene Predictions Track : RefSeq Gene (I used NCBI RefSeq) Table : refGene (I used UCSC ReSeq (refGene)) region : Check genome Output fromat : GTF - gene transfer format (limited) Output file : hg38.ucsc.gtf Pressed Get output

Finally I modified the SEXCMD.R script as follows: changed "if (sextype = "XY")" to "if (sextype == "XY")" (appears in two different lines) changed "if (sextype = "ZW")" to "if (sextype == "ZW")" (appears in two different lines)

Please let me know that the modifications were properly made.

Best,

Nikos

Hi Nikos,

I've had the same issues that you described in your original post. I made the changes that you suggested but I am still having a problem in the first step. In line 29 of the code, I keep getting this error:

Traceback (most recent call last): File "PATH/SEXCMD/1.extract_chr_from_REF.py", line 29, in for idx in xrange(0, len(chr_dict[CHR]), 50): KeyError: 'chrX'

I was wondering if you encountered this and how you solved it. Thank you very much.

Best,

Nelson

lovemun commented 4 years ago

I downloaded the hg38.fa.gz from UCSC and gunzipped it. (should I have indexed the genome, e.g. with BWA?)

Then I typed: python2 mypath/SEXCMD/1.extract_chr_from_REF.py mypath/UCSC/hg38/hg38.fa Y

It returned again the same issue

Traceback (most recent call last): File "mypath/SEXCMD/1.extract_chr_from_REF.py", line 28, in outfile.write('>' + CHR + '\n') NameError: name 'outfile' is not defined

I adjust the script 1.extract_chr_from_REF.py file. Please re-try.