rlichtenwalter / mRMR

A re-implementation of the minimum redundancy maximum relevance (mRMR) feature selection algorithm with emphasis on greatly increased perfomance (1000x or greater on large data sets) and an improved user interface.
GNU General Public License v3.0
3 stars 1 forks source link

error: integer overflow detected at line 2 column 2 ; contact author #2

Open saskra opened 4 years ago

saskra commented 4 years ago

https://github.com/rlichtenwalter/mRMR/blob/dd1015dce4c176fa0dab41dac49077539582b754/include/dataset.hpp#L120

What is the maximum size allowed?

saskra commented 4 years ago

And how many rows and columns are allowed, if that is the problem here?

$ bin/mrmr -t ',' -c 1 -d 'round' -v 2 top_proteins.csv 
2020-10-05 10:39:56 - FILE = top_proteins.csv
2020-10-05 10:39:56 - Reading and transforming dataset and computing attribute information...
2020-10-05 10:39:56 - Reading from file...
DONE (1.941491e-02 seconds)
2020-10-05 10:39:56 - Calculating mutual information between each attribute and class...Memory access error (memory dump written)
rlichtenwalter commented 3 years ago

Hi Saskra,

I'm sorry for the trouble, and I sincerely appreciate your perseverance. Are you willing and able to share the data file? I'll take a look at it personally and either resolve the issue with the code or let you know what's wrong with the data file.

To directly answer your question, there is no predefined limit on the number of rows or columns. The code is designed to handle extremely large data sets (if operating correctly). The integer overflow suggests that something is preventing proper compression of values that are being passed. It is worth noting that the discretization routines that are provided are basic. There is no provision for anything along the lines of rounded z-scoring that might otherwise be desirable, because that gets outside the scope of this code. If you're passing in large floating point values, they may easily overflow the representable integer values. Because this method (mRMR) is based on mutual information, it is inherently designed to operate on either categorical values or numerical splits, but numerical splitting isn't implemented here, so instead numerical values are binned, and those bins are expected to be of a fairly limited cardinality for method efficacy.

Best,

Ryan

On Mon, Oct 5, 2020 at 2:45 AM saskra notifications@github.com wrote:

And how many rows and columns are allowed, if that is the problem here?

$ bin/mrmr -t ',' -c 1 -d 'round' -v 2 top_proteins.csv 2020-10-05 10:39:56 - FILE = top_proteins.csv 2020-10-05 10:39:56 - Reading and transforming dataset and computing attribute information... 2020-10-05 10:39:56 - Reading from file... DONE (1.941491e-02 seconds) 2020-10-05 10:39:56 - Calculating mutual information between each attribute and class...Memory access error (memory dump written)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rlichtenwalter/mRMR/issues/2#issuecomment-703492771, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQZIVXZECYMVF3ISG3YYGLSJGBSZANCNFSM4SEKNAZA .

saskra commented 3 years ago

Many thanks for your effort! Are there any restrictions regarding integer/floating point numbers, signed or unsigned, between 8 and 64 bit? I have already recognized one problem: The class numbers seem to be allowed to be only 0 or 1 and the names of the characteristics must not be numbers.

I have discretized my data in two different ways and created a highly abbreviated file in addition to the original size (see zip file attached). But the different error messages do not help me.

$ bin/mrmr example1_small.tsv 
2020-10-08 09:13:05 - No discretization method chosen. Default 'truncate' used...
Rank    Index   Name    Entropy Mutual Information  mRMR Score
0   0   Class   9.957275e-01    9.957275e-01    nan
1   1   f0  1.419556e+00    9.957275e-01    9.957275e-01
2   3   f2  1.548581e+00    6.222849e-01    0.000000e+00
3   2   f1  1.295738e+00    6.771344e-01    3.510671e-02
4   4   f3  1.314320e+00    7.265053e-02    -1.236555e-01
5   5   f4  1.140116e+00    9.768345e-02    -2.556513e-01
$ bin/mrmr example1_big.tsv 
2020-10-08 09:13:10 - No discretization method chosen. Default 'truncate' used...
Rank    Index   Name    Entropy Mutual Information  mRMR Score
0   0   Class   9.957275e-01    9.957275e-01    nan
1   2451    f2450   9.957275e-01    2.230954e+00    2.230954e+00
munmap_chunk(): invalid pointer
Cancelled (memory dump written)
$ bin/mrmr example2_small.tsv 
2020-10-08 09:13:21 - No discretization method chosen. Default 'truncate' used...
malloc(): memory corruption
Cancelled (memory dump written)
$ bin/mrmr example2_big.tsv 
2020-10-08 09:13:30 - No discretization method chosen. Default 'truncate' used...
malloc(): memory corruption
Cancelled (memory dump written)

mRMR.zip

rlichtenwalter commented 3 years ago

Hi again,

Thanks for including this. I'll definitely acknowledge that the robustness of the parser and the quality of its error messages are wanting. I had started a branch to improve this, among other things, but never got around to finishing that.

I'll see if I can get you all the answers you need to proceed, and I'll aim to do so within the next few days.

Regards,

Ryan

On Thu, Oct 8, 2020 at 1:28 AM saskra notifications@github.com wrote:

Many thanks for your effort! Are there any restrictions regarding integer/floating point numbers, signed or unsigned, between 8 and 64 bit? I have already recognized one problem: The class numbers seem to be allowed to be only 0 or 1 and the names of the characteristics must not be numbers.

I have discretized my data in two different ways and created a highly abbreviated file in addition to the original size (see zip file attached). But the different error messages do not help me.

$ bin/mrmr example1_small.tsv 2020-10-08 09:13:05 - No discretization method chosen. Default 'truncate' used... Rank Index Name Entropy Mutual Information mRMR Score 0 0 Class 9.957275e-01 9.957275e-01 nan 1 1 f0 1.419556e+00 9.957275e-01 9.957275e-01 2 3 f2 1.548581e+00 6.222849e-01 0.000000e+00 3 2 f1 1.295738e+00 6.771344e-01 3.510671e-02 4 4 f3 1.314320e+00 7.265053e-02 -1.236555e-01 5 5 f4 1.140116e+00 9.768345e-02 -2.556513e-01 $ bin/mrmr example1_big.tsv 2020-10-08 09:13:10 - No discretization method chosen. Default 'truncate' used... Rank Index Name Entropy Mutual Information mRMR Score 0 0 Class 9.957275e-01 9.957275e-01 nan 1 2451 f2450 9.957275e-01 2.230954e+00 2.230954e+00munmap_chunk(): invalid pointer Cancelled (memory dump written) $ bin/mrmr example2_small.tsv 2020-10-08 09:13:21 - No discretization method chosen. Default 'truncate' used...malloc(): memory corruption Cancelled (memory dump written) $ bin/mrmr example2_big.tsv 2020-10-08 09:13:30 - No discretization method chosen. Default 'truncate' used...malloc(): memory corruption Cancelled (memory dump written)

mRMR.zip https://github.com/rlichtenwalter/mRMR/files/5345993/mRMR.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rlichtenwalter/mRMR/issues/2#issuecomment-705386368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQZIVXNAPVRROJ4LDFWSNDSJVSY5ANCNFSM4SEKNAZA .

rlichtenwalter commented 3 years ago

Hi again,

Upon looking at the data sets, I saw the issue right away, but that's not to say that the issue is yours. The data set expected unsigned values, but it doesn't complain if it sees signed values and nastily progresses to failure. I had been working on a branch a long time ago that detected this and did a lot of other nice things, but I never finished the work and committed. It's bad enough that the code doesn't detect the issue. It's even worse that I didn't document this anywhere. I've added a note to the readme about expectations, which is really the least I can do.

For you, I suggest doing something along the lines of the following:

< ../example1_small.tsv awk -v 'OFS=\t' -F '\t' 'NR==1{print;}NR>1{for(i=2;i<=NF;++i){$i=$i+1;}print;}'

Assuming you have contiguous integers and it is only that some of them are negative, just add to each an amount that raises the minimum value to zero. Of course, you probably have your own preferred way to do this. In any case, with that fix, you should find that the data sets on which you're operating are really fast to compute.

Please let me know if there are any other questions or issues!

Best,

Ryan

On Thu, Oct 8, 2020 at 1:28 AM saskra notifications@github.com wrote:

Many thanks for your effort! Are there any restrictions regarding integer/floating point numbers, signed or unsigned, between 8 and 64 bit? I have already recognized one problem: The class numbers seem to be allowed to be only 0 or 1 and the names of the characteristics must not be numbers.

I have discretized my data in two different ways and created a highly abbreviated file in addition to the original size (see zip file attached). But the different error messages do not help me.

$ bin/mrmr example1_small.tsv 2020-10-08 09:13:05 - No discretization method chosen. Default 'truncate' used... Rank Index Name Entropy Mutual Information mRMR Score 0 0 Class 9.957275e-01 9.957275e-01 nan 1 1 f0 1.419556e+00 9.957275e-01 9.957275e-01 2 3 f2 1.548581e+00 6.222849e-01 0.000000e+00 3 2 f1 1.295738e+00 6.771344e-01 3.510671e-02 4 4 f3 1.314320e+00 7.265053e-02 -1.236555e-01 5 5 f4 1.140116e+00 9.768345e-02 -2.556513e-01 $ bin/mrmr example1_big.tsv 2020-10-08 09:13:10 - No discretization method chosen. Default 'truncate' used... Rank Index Name Entropy Mutual Information mRMR Score 0 0 Class 9.957275e-01 9.957275e-01 nan 1 2451 f2450 9.957275e-01 2.230954e+00 2.230954e+00munmap_chunk(): invalid pointer Cancelled (memory dump written) $ bin/mrmr example2_small.tsv 2020-10-08 09:13:21 - No discretization method chosen. Default 'truncate' used...malloc(): memory corruption Cancelled (memory dump written) $ bin/mrmr example2_big.tsv 2020-10-08 09:13:30 - No discretization method chosen. Default 'truncate' used...malloc(): memory corruption Cancelled (memory dump written)

mRMR.zip https://github.com/rlichtenwalter/mRMR/files/5345993/mRMR.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rlichtenwalter/mRMR/issues/2#issuecomment-705386368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQZIVXNAPVRROJ4LDFWSNDSJVSY5ANCNFSM4SEKNAZA .

saskra commented 3 years ago

< ../example1_small.tsv awk -v 'OFS=\t' -F '\t' 'NR==1{print;}NR>1{for(i=2;i<=NF;++i){$i=$i+1;}print;}'

Thanks, but this helps only with the first of my four examples. :-(

rlichtenwalter commented 3 years ago

Hi Saskra,

Sorry. After the hiatus, I got into things with work and forgot about this. Do you still need help with the other examples? Sorry for dropping this!

Ryan

saskra commented 3 years ago

I could not solve the problem, but I am also busy with other projects at the moment.