pipeline.encode() failing or hanging infinitely

pohutukawa commented 1 year ago

I have been throwing a log of (very short) data at the encoder to see how it behaves after seeing issues in our project with it. We only have rather short data sequences that need to be encoded, so it was quite easy to run it in a loop and see how it behaves.

In many cases ValueErrors were raised with an empty range exception for randrange(), e.g. empty range for randrange() (10, 8, -2). The error originated in this code bit:

scheme.py, line 424, in _searching_results
    random_index = random.randint(total_count + 3, math.pow(2, index_length) - 1)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Far more seldomly, but still in many cases the code was getting trapped within an infinite loop (or at least one that spun around and did not finish for minutes). This loop has always appeared within YYC._list_to_sequence during a deepcopy() operation.

Any insights on what may be going wrong here? Or how it could be fixed?

HaolingZHANG commented 1 year ago

Dear Dr. Guy Kloss,

This is a good question, involving the parameter optimization of this method.

I would like to present the entire procedure for the function where a ValueError occurred, i.e., "_searching_results". In scheme.py, between Lines 388 and 404, we aim to find a paired bit segment from candidates "other_lists" that meets the constraints after a single encoding process based on a given bit segment "fixed_list" within a specified limit of candidate search "self.search_count". However, there may be cases where we are unable to find a suitable paired bit segment within a limited number of searches. In such cases, as illustrated in Lines 405 to 435 of scheme.py, we resort to generating a random bit segment to complete this search. For further information on the design and simulation details, please refer to our paper.

At Line 424 in scheme.py, the objective is to create an index region for a randomly generated bit segment. There is a requirement that this generated bit segment should not impact the data recovery process (refer to Lines 132-181 in utils.index_operator.py). Specifically, the index of this generated bit segment should be at least two greater than the maximum index of all other bit segments. This is because the loss of two consecutive sequences is unlikely. Under this requirement, the range of "random.randint" may cause ValueError because the inputted minimum value (total_count + 3) is greater than or equal to the inputted maximum value (2^l, where l is the length of index region).

There are three possible solutions to this issue (or the rapid encoding issues):

During initialization, the length of the index region is set instead of being automatically generated based on the number of bit segments. We apologize for not providing this parameter in the initialization function of this code repository, so you need to directly modify Line 307 in scheme.py. However, we highly recommend utilizing our software, Chamaeleo, which combines conventional coding schemes (including Yin-Yang Code) with more algorithm evaluation systems.
Rapid encoding can be achieved by adjusting the combination of yin-yang rules. Different combinations of these rules are suitable for different files or file patterns. Analysis of these combinations can be found in Figures S3, S4, and Table S4 in the supplementary information of our paper.
In some cases, overly strict constraints may also lead to longer encoding time.

If you have any further questions about this, please feel free to let me know.

Cheers, Haoling

pohutukawa commented 1 year ago

Hi Haoling,

thanks for the quick response, and apologies from my late reply on this (I was off for a week). I had started to read your paper 'Towards practical and robust DNA-based data archiving using the yin-yang codec system' already. I was in parallel trying to get a 'feel' for it by using your implementation. Also, as I was not interested in file-based data encoding, I had modified your implementation in a personal fork to support generic streams for data i/o, allowing me for an easier experimentation in memory:

https://github.com/pohutukawa/DNA-storage-YYC/tree/feature/generic-streams

I have just had a look at Chamaeleo, which seems much more capable and universal with different encoders. Very interesting. But unfortunately the GitHub code in the master branch does not seem to (fully) include the yin-yang codec, and the description provided in tutorial.rst cannot be reproduced (also the codec_factory is missing and thus not importable). I'd suppose there may be some things still missing in merging from other (private) branches.

I will continue reading your paper, and see if I can make do with this code base. But I'd be nonetheless be interested in Chamaeleo (if it was to get fixed soon). I can possibly (again) also contribute generic stream handling to it, too.

Cheers,

Guy

pohutukawa commented 1 year ago

I have found the specific piece in Chamaelo that allows me to exercise the YinYangCodec. I have made a personal feature branch that implements parameters input_stream and output_stream to pipeline.transcode() (together with an example file):

https://github.com/pohutukawa/Chamaeleo/tree/feature/generic-streams

Now I'll see how this works for the different payloads (as indicated above).

pohutukawa commented 1 year ago

Just trying it with Chamaeleo, I can see that the issue seems to be the same. So I cannot create an encoding when the data does not meet the minimum requirement given by the segment_size. And if I have quite regularly structured data (e.g. multiples of the byte sequence b"foo bar baz ", then the code runs into issues with quickly requiring excessive time. Whereas if I use high entropy data (e.g. random byte sequences), the transcoder works quite efficiently.

I suppose these things just need to be known for usage (if the above assumptions are correct). So one may not want to (blindly) apply the transcoder to very structured files with low entropy. It may be worth documenting that (both in this repo as well as in Chamaelio).

Though, I would be keen to find out if there are ways to mitigate this.

Also, further discussion should also likely then be conducted on the Chamaelio repo :-)

HaolingZHANG commented 1 year ago

Hi Guy,

A rapid solution here: the measure for this scenario is to compress digital data in advance. DNA-fountain (Erlich et al. Science, 2017) adopted this measure. Compressing digital data can better balance the data pattern at the byte-frequency level. Thus, multiple repetitions of a certain binary message are less likely to occur.

It may require refactoring the code instead of simple renovations to achieve your potential objective (maybe hard work), so the following contents are some comments for you.

As far as I'm concerned, it seems that your objective is to modify YinYang Code from a file-based coding system to a stream-based coding system. From a computer standpoint, it is a commendable optimization objective as it could pave the way for real-time information writing. Nevertheless, the sluggish pace of DNA synthesis stands as a major obstacle to realizing real-time processing. In contrast, the time needed to convert digital data into DNA strings is relatively minor. This is why many well-established coding systems treat a file as a data package during the encoding process.

There are generally two categories of methods: constrained coding scheme and screen-based coding scheme (e.g., DNA-fountain and YinYang Code). The latter may have excessive time consumption issues without parameter tuning, as you mentioned above.

If you want to create a stream-based screen-based coding scheme, you may need a queue to store continuously incoming bit/byte streams. There are some programming suggestions.

After receiving a new binary message, the pipeline could be several for-if statements. That is, encoding the newly received binary message and a cached binary message in the queue by YinYang Code and checking the produced DNA string. If the produced DNA string is valid, two binary messages and this DNA sequence can be outputted. Otherwise, this new binary message will be cached at the end of the queue.
When the waiting time of a cached binary message exceeds the expectation (e.g., 1 minute), it will enter the random matching phase. It may be considered to open a new thread to complete this task. There are two key points that may need to be considered: [1] Does this binary message require index information to determine its location in the entire stream? If not needed, consider directly randomly generating an appearance binary message. [2] In Lines 497 – 532 in scheme.py, the default process is to add one random bit at a time and determine whether the current DNA string in the current situation is valid. Essentially, this is a regional constraint that is more stringent than the global constraint, significantly increasing the failure rate of random matching. You can consider checking the validity of the generated DNA string after the encoding process is completed.
You can also follow some of the suggestions I gave earlier, and as you gain more understanding of the field of DNA-based data storage, I believe they will also be useful for your goals.

If you have any further questions about this, please feel free to let me know

Cheers, Haoling

HaolingZHANG commented 1 year ago

By the way, if you are interested in collaborating with us to promote the adaptation of YinYang Code on the stream-based coding scheme, please feel free to contact Dr. Zhi Ping and cc me (pingzhi[at]genomics.cn, zhanghaoling[at]genomics.cn).

Haoling

pohutukawa commented 1 year ago

Kia ora Haoliing. I've sent an email.

ntpz870817 / DNA-storage-YYC

pipeline.encode() failing or hanging infinitely #3