stamatak / ExaML

Exascale Maximum Likelihood (ExaML) code for phylogenetic inference using MPI
49 stars 24 forks source link

Large number of sites overflows in parser #8

Closed jasondk closed 8 years ago

jasondk commented 9 years ago

In axml.h, the rawdata->sites variable is defined as type int. Attempting to compress an alignment with about 3B positions is resulting in a "too few sites" error, presumably because we are overflowing the int. We will also have more than 32k site patterns after compression, and some of these will occur more than 32k times in the dataset - so we will still be causing overflows in the site/alias indexes even after changing sites to long long int. Is there a quick fix for this? Thanks!

stamatak commented 9 years ago

I'll try to fix this soon, I don't think that the fix is easy if you don't know the ExaML code well, please use the RAxML google group for reporting bugs in the future, thereby all users are aware of potential problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|. Attempting to compress an alignment with about 3B positions is resulting in a "too few sites" error, presumably because we are overflowing the |int|. We will also have more than 32k site patterns after compression, and some of these will occur more than 32k times in the dataset - so we will still be causing overflows in the site/alias indexes even after changing |sites| to |long long int|. Is there a quick fix for this? Thanks!

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

stamatak commented 9 years ago

Dear Jason,

I think that I have fixed it but I need access to the dataset for testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy if you don't know the ExaML code well, please use the RAxML google group for reporting bugs in the future, thereby all users are aware of potential problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|. Attempting to compress an alignment with about 3B positions is resulting in a "too few sites" error, presumably because we are overflowing the |int|. We will also have more than 32k site patterns after compression, and some of these will occur more than 32k times in the dataset - so we will still be causing overflows in the site/alias indexes even after changing |sites| to |long long int|. Is there a quick fix for this? Thanks!

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

jasondk commented 9 years ago

Hey Alexis,

Thanks so much and sorry for the delay in responding. You can download the compressed dataset (5GB, sorry!) here http://hyperion.ucalgary.ca/example.phy.bz2. I’ll leave the link up for a couple of days. If you have a problem downloading it, you could just simulate a similar dataset. The dimensions are 7 OTUs and 3,036,303,846 sites with very little divergence (most of this will compress out if indexing site patterns).

Best wishes,

A.P. Jason de Koning, Ph.D.

Assistant Professor University of Calgary, Faculty of Medicine and Alberta Children's Hospital Research Institute for Child and Maternal Health Dept. of Biochemistry and Molecular Biology Dept. of Medical Genetics

Health Sciences Centre 1150 Suite 3330 Hospital Drive N.W. Calgary, Alberta T2N 4N1 Canada

Office: 403-210-7638 | Fax: 403-270-8928 Email: jason.dekoning@ucalgary.ca Web: http://lab.jasondk.io

On May 26, 2015, at 1:29 PM, Alexis Stamatakis notifications@github.com wrote:

Dear Jason,

I think that I have fixed it but I need access to the dataset for testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy if you don't know the ExaML code well, please use the RAxML google group for reporting bugs in the future, thereby all users are aware of potential problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|. Attempting to compress an alignment with about 3B positions is resulting in a "too few sites" error, presumably because we are overflowing the |int|. We will also have more than 32k site patterns after compression, and some of these will occur more than 32k times in the dataset - so we will still be causing overflows in the site/alias indexes even after changing |sites| to |long long int|. Is there a quick fix for this? Thanks!

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org — Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-105641958.

stamatak commented 9 years ago

Hi Jason,

The modified parser works now, how quickly do you need the fix?

I am in the middle of a larger re-design, thus the code with the fixed parser is not ready for release yet.

Below is the output of the parser, does that look right? It looks rather weird to me.

Alexis

Pattern compression: ON

Alignment has 200630281 completely undetermined sites that will be automatically removed from the binary alignment file

Your alignment has 5956 unique patterns

Under CAT the memory required by ExaML for storing CLVs and tip vectors will be 1375836 bytes 1343 kiloBytes 1 MegaBytes 0 GigaBytes

Under GAMMA the memory required by ExaML for storing CLVs and tip vectors will be 5378268 bytes 5252 kiloBytes 5 MegaBytes 0 GigaBytes

Please note that, these are just the memory requirements for doing likelihood calculations! To be on the safe side, we recommend that you execute ExaML on a system with twice that memory.

Binary and compressed alignment file written to file HUGE.binary

Parsing completed, exiting now ...

On 26.05.2015 23:06, A.P. Jason de Koning wrote:

Hey Alexis,

Thanks so much and sorry for the delay in responding. You can download the compressed dataset (5GB, sorry!) here http://hyperion.ucalgary.ca/example.phy.bz2. I’ll leave the link up for a couple of days. If you have a problem downloading it, you could just simulate a similar dataset. The dimensions are 7 OTUs and 3,036,303,846 sites with very little divergence (most of this will compress out if indexing site patterns).

Best wishes,

  • Jason

A.P. Jason de Koning, Ph.D.

Assistant Professor University of Calgary, Faculty of Medicine and Alberta Children's Hospital Research Institute for Child and Maternal Health Dept. of Biochemistry and Molecular Biology Dept. of Medical Genetics

Health Sciences Centre 1150 Suite 3330 Hospital Drive N.W. Calgary, Alberta T2N 4N1 Canada

Office: 403-210-7638 | Fax: 403-270-8928 Email: jason.dekoning@ucalgary.ca Web: http://lab.jasondk.io

On May 26, 2015, at 1:29 PM, Alexis Stamatakis notifications@github.com wrote:

Dear Jason,

I think that I have fixed it but I need access to the dataset for testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy if you don't know the ExaML code well, please use the RAxML google group for reporting bugs in the future, thereby all users are aware of potential problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|. Attempting to compress an alignment with about 3B positions is resulting in a "too few sites" error, presumably because we are overflowing the |int|. We will also have more than 32k site patterns after compression, and some of these will occur more than 32k times in the dataset - so we will still be causing overflows in the site/alias indexes even after changing |sites| to |long long int|. Is there a quick fix for this? Thanks!

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org — Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-105641958.

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-105664817.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

jasondk commented 9 years ago

Hey Alexis, this looks approximately correct to me. We’d previously run just the variable sites from this dataset and had similar results. Can you possibly make the binary output of the parser for this dataset available to us for download? Or allow us access to the revised parser? This is for the last piece of a student project that is otherwise complete. Thanks! Jason

On May 28, 2015, at 12:30 AM, Alexis Stamatakis notifications@github.com wrote:

Hi Jason,

The modified parser works now, how quickly do you need the fix?

I am in the middle of a larger re-design, thus the code with the fixed parser is not ready for release yet.

Below is the output of the parser, does that look right? It looks rather weird to me.

Alexis

Pattern compression: ON

Alignment has 200630281 completely undetermined sites that will be automatically removed from the binary alignment file

Your alignment has 5956 unique patterns

Under CAT the memory required by ExaML for storing CLVs and tip vectors will be 1375836 bytes 1343 kiloBytes 1 MegaBytes 0 GigaBytes

Under GAMMA the memory required by ExaML for storing CLVs and tip vectors will be 5378268 bytes 5252 kiloBytes 5 MegaBytes 0 GigaBytes

Please note that, these are just the memory requirements for doing likelihood calculations! To be on the safe side, we recommend that you execute ExaML on a system with twice that memory.

Binary and compressed alignment file written to file HUGE.binary

Parsing completed, exiting now ...

On 26.05.2015 23:06, A.P. Jason de Koning wrote:

Hey Alexis,

Thanks so much and sorry for the delay in responding. You can download the compressed dataset (5GB, sorry!) here http://hyperion.ucalgary.ca/example.phy.bz2. I’ll leave the link up for a couple of days. If you have a problem downloading it, you could just simulate a similar dataset. The dimensions are 7 OTUs and 3,036,303,846 sites with very little divergence (most of this will compress out if indexing site patterns).

Best wishes,

  • Jason

A.P. Jason de Koning, Ph.D.

Assistant Professor University of Calgary, Faculty of Medicine and Alberta Children's Hospital Research Institute for Child and Maternal Health Dept. of Biochemistry and Molecular Biology Dept. of Medical Genetics

Health Sciences Centre 1150 Suite 3330 Hospital Drive N.W. Calgary, Alberta T2N 4N1 Canada

Office: 403-210-7638 | Fax: 403-270-8928 Email: jason.dekoning@ucalgary.ca Web: http://lab.jasondk.io

On May 26, 2015, at 1:29 PM, Alexis Stamatakis notifications@github.com wrote:

Dear Jason,

I think that I have fixed it but I need access to the dataset for testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy if you don't know the ExaML code well, please use the RAxML google group for reporting bugs in the future, thereby all users are aware of potential problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|. Attempting to compress an alignment with about 3B positions is resulting in a "too few sites" error, presumably because we are overflowing the |int|. We will also have more than 32k site patterns after compression, and some of these will occur more than 32k times in the dataset - so we will still be causing overflows in the site/alias indexes even after changing |sites| to |long long int|. Is there a quick fix for this? Thanks!

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org — Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-105641958.

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-105664817.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org — Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-106194236.

stamatak commented 9 years ago

just sent the code to your university email,

alexis

On 29.05.2015 16:23, A.P. Jason de Koning wrote:

Hey Alexis, this looks approximately correct to me. We’d previously run just the variable sites from this dataset and had similar results. Can you possibly make the binary output of the parser for this dataset available to us for download? Or allow us access to the revised parser? This is for the last piece of a student project that is otherwise complete. Thanks! Jason

On May 28, 2015, at 12:30 AM, Alexis Stamatakis notifications@github.com wrote:

Hi Jason,

The modified parser works now, how quickly do you need the fix?

I am in the middle of a larger re-design, thus the code with the fixed parser is not ready for release yet.

Below is the output of the parser, does that look right? It looks rather weird to me.

Alexis

Pattern compression: ON

Alignment has 200630281 completely undetermined sites that will be automatically removed from the binary alignment file

Your alignment has 5956 unique patterns

Under CAT the memory required by ExaML for storing CLVs and tip vectors will be 1375836 bytes 1343 kiloBytes 1 MegaBytes 0 GigaBytes

Under GAMMA the memory required by ExaML for storing CLVs and tip vectors will be 5378268 bytes 5252 kiloBytes 5 MegaBytes 0 GigaBytes

Please note that, these are just the memory requirements for doing likelihood calculations! To be on the safe side, we recommend that you execute ExaML on a system with twice that memory.

Binary and compressed alignment file written to file HUGE.binary

Parsing completed, exiting now ...

On 26.05.2015 23:06, A.P. Jason de Koning wrote:

Hey Alexis,

Thanks so much and sorry for the delay in responding. You can download the compressed dataset (5GB, sorry!) here http://hyperion.ucalgary.ca/example.phy.bz2. I’ll leave the link up for a couple of days. If you have a problem downloading it, you could just simulate a similar dataset. The dimensions are 7 OTUs and 3,036,303,846 sites with very little divergence (most of this will compress out if indexing site patterns).

Best wishes,

  • Jason

A.P. Jason de Koning, Ph.D.

Assistant Professor University of Calgary, Faculty of Medicine and Alberta Children's Hospital Research Institute for Child and Maternal Health Dept. of Biochemistry and Molecular Biology Dept. of Medical Genetics

Health Sciences Centre 1150 Suite 3330 Hospital Drive N.W. Calgary, Alberta T2N 4N1 Canada

Office: 403-210-7638 | Fax: 403-270-8928 Email: jason.dekoning@ucalgary.ca Web: http://lab.jasondk.io

On May 26, 2015, at 1:29 PM, Alexis Stamatakis notifications@github.com wrote:

Dear Jason,

I think that I have fixed it but I need access to the dataset for testing.

Cheers,

Alexis

On 20.05.2015 21:36, Alexandros Stamatakis wrote:

I'll try to fix this soon, I don't think that the fix is easy if you don't know the ExaML code well, please use the RAxML google group for reporting bugs in the future, thereby all users are aware of potential problems.

Alexis

On 11.05.2015 19:43, A.P. Jason de Koning wrote:

In |axml.h|, the |rawdata->sites| variable is defined as type |int|. Attempting to compress an alignment with about 3B positions is resulting in a "too few sites" error, presumably because we are overflowing the |int|. We will also have more than 32k site patterns after compression, and some of these will occur more than 32k times in the dataset - so we will still be causing overflows in the site/alias indexes even after changing |sites| to |long long int|. Is there a quick fix for this? Thanks!

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org — Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-105641958.

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-105664817.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org — Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-106194236.

— Reply to this email directly or view it on GitHub https://github.com/stamatak/ExaML/issues/8#issuecomment-106826195.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org