tatsuhikonaito / DEEP-HLA

Upload test
Other
20 stars 10 forks source link

IndexError: positional indexers are out-of-bounds #3

Open freshfischer opened 3 years ago

freshfischer commented 3 years ago

hello @tatsuhikonaito , it's vary kind of you to creat and share this algorithm. i met some errors when applying the train.py to my own reference panel and sample data in cross validation, i'm not sure is this error came out due to the large sample size or too many SNP sites? The detailed information of this error as follow: `time python $Software_Dir/train.py --ref ${ref} --sample $indir/$sample --model $model --hla $hla --model-dir $indir/$sample.model Logging to training.log. Training processes started at Thu May 13 20:19:27 2021. Loading files... 10689 people loaded from reference. 29948 SNPs loaded from reference. 27956 SNPs loaded from sample. 27990 SNPs matched in position and used for training. Traceback (most recent call last): File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1469, in _get_list_axis return self.obj._take_with_is_copy(key, axis=axis) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3363, in _take_with_is_copy result = self.take(indices=indices, axis=axis) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3351, in take indices, axis=self._get_block_manager_axis(axis), verify=True File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1440, in take indexer = maybe_convert_indices(indexer, n) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexers.py", line 250, in maybe_convert_indices raise IndexError("indices are out-of-bounds") IndexError: indices are out-of-bounds

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/dengcm/soft/DEEP-HLA//train.py", line 375, in main() File "/home/dengcm/soft/DEEP-HLA//train.py", line 371, in main train(args) File "/home/dengcm/soft/DEEP-HLA//train.py", line 189, in train ref_concord_phased = ref_phased.iloc[np.where(concord_snp)[0]] File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 879, in getitem return self._getitem_axis(maybe_callable, axis=axis) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1487, in _getitem_axis return self._get_list_axis(key, axis=axis) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1472, in _get_list_axis raise IndexError("positional indexers are out-of-bounds") from err IndexError: positional indexers are out-of-bounds ` Looking forward to your kindly reply!

tatsuhikonaito commented 3 years ago

Hi, @freshfischer. Thank you for getting interested in our tool! It seems that there may be a mismatch in the number of SNPs between the reference bim file and the reference bgl.phased file. Could you check it? Another thing that needs to be fixed is that the number of SNPs that are matched exceeds the number of SNPs included in the sample. This seems to happen because some of the SNPs in the sample file are consistent with some of the HLA alleles in position. DEEP*HLA aligns the SNPs by position to save the trouble of aligning the names of the SNPs between the reference and the sample, but this seems to have backfired. We may repair the script to avoid this error in the future, but, tentatively could you please use "train.align_by_name.py" and "impute.align_by_name.py" instead? Im hope this will work for cross-validation, since the SNP names definitely match between the sample and the reference. Also, please let me know if you get any errors.

freshfischer commented 3 years ago

hi, @tatsuhikonaito thank you for you updated script. i checked the number of SNPs between the reference bim file and the reference bgl.phased file, both were 29948 SNP sites, and then i rerun train model with "train.align_by_name.py", error still comes out,it seems not only due to the problem of overlaped positions between SNP and HLA alleles. error detailed as following: `python $Software_Dir/train.align_by_name.py --ref $ref --sample $indir/$sample --model $model --hla $hla --model-dir $indir/$sample.model Logging to training.log. Training processes started at Thu May 13 23:34:45 2021. Loading files... 10689 people loaded from reference. 29948 SNPs loaded from reference. 27956 SNPs loaded from sample. 27956 SNPs matched in name and used for training. Traceback (most recent call last): File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1469, in _get_list_axis return self.obj._take_with_is_copy(key, axis=axis) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3363, in _take_with_is_copy result = self.take(indices=indices, axis=axis) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 3351, in take indices, axis=self._get_block_manager_axis(axis), verify=True File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1440, in take indexer = maybe_convert_indices(indexer, n) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexers.py", line 250, in maybe_convert_indices raise IndexError("indices are out-of-bounds") IndexError: indices are out-of-bounds

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/dengcm/soft/DEEP-HLA//train.align_by_name.py", line 375, in main() File "/home/dengcm/soft/DEEP-HLA//train.align_by_name.py", line 371, in main train(args) File "/home/dengcm/soft/DEEP-HLA//train.align_by_name.py", line 189, in train ref_concord_phased = ref_phased.iloc[np.where(concord_snp)[0]] File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 879, in getitem return self._getitem_axis(maybe_callable, axis=axis) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1487, in _getitem_axis return self._get_list_axis(key, axis=axis) File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1472, in _get_list_axis raise IndexError("positional indexers are out-of-bounds") from err IndexError: positional indexers are out-of-bounds

` Looking forward to your kindly reply!

tatsuhikonaito commented 3 years ago

Hi, @freshfischer. I am sorry for a late reply. Anyway, this error seems like the reference file would be incompatible with our script. So could you print the shape of the reference bgl and bim files by adding the following code in line 166 of the train.align_by_name.py file. print(ref_bim.shape) print(ref_phased.shape) Thank you in advance for you try.

freshfischer commented 3 years ago

thanks a lot for your kindly reply @tatsuhikonaito , i rechecked the input .bim and .bgl.phased file of reference panel, i found the head 5 rows form of *.bgl.phased in reference panel i quoted differs from Pan-Asian-REF.bgl.phsed. Traning model worked when change the form to consist with Pan-Asian-REF.bgl.phsed, but error still came out as follow: `[Epoch 7]: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 318/318 [12:04<00:00, 1.89s/it] HLA_C training accuracy: 0.9902511239051819 HLA_B training accuracy: 0.9786311984062195 Average training accuracy: 0.9844411611557007 HLA_C validation accuracy: 0.9897003769874573 HLA_B validation accuracy: 0.9728464484214783 Average validation accuracy: 0.9812734127044678 [Epoch 8]: 4%|█████ | 12/318 [00:28<12:18, 2.41s/it] Killed

real 743m52.352s user 17087m38.958s sys 791m50.107s ` This model works well when training for smaller sample size(N=100), is this algorithm not work for large sample size(N>10000) and large SNP sites(N>20000)? or is this the limit of CPU capacity? Looking forward to your kindly reply!

tatsuhikonaito commented 3 years ago

Hi, @freshfischer. Considering that it works for a smaller size, but stops on the way for a larger size, I suspect that this error would be due to the memory limit of your machine. We may update our tool to support parallel computing in the future; but, at this point, there seems no way to solve this problem except to try using another machine. I'm sorry.